Systems and methods for correcting error in biological response signal data

ABSTRACT

A method for fluorophore bias removal in microarray experiments in which the fluorophores used in microarray experiment pairs are reversed. Further, a method for calculating the individual errors associated with each measurement made in nominally repeated microarray experiments. This error measurement is optionally coupled with rank based methods in order to determine a probability that a cellular constituent is up or down regulated in response to a perturbation. Finally, a method for determining the confidence in the weighted average of the expression level of a cellular constituent in nominally repeated microarray experiments.

1 FIELD OF THE INVENTION

The field of this invention relates to methods for using data frommultiple repeated experiments to generate a confidence value for eachdata point, increase sensitivity, and eliminate systematic experimentalbias.

2 BACKGROUND OF THE INVENTION

2.1 Quantitative Measurement of Cellular Constituents

There is currently an explosive increase in the generation ofquantitative measurements of the levels of “cellular constituents”.Cellular constituents include gene expression levels, abundance of mRNAencoding specific genes, and protein expression levels in a biologicalsystem. Levels of various constituents of a cell, such as mRNA encodinggenes and/or protein expression levels, are known to change in responseto drug treatments and other perturbations of the cell's biologicalstate. Measurements of a plurality of such “cellular constituents”therefore contain a wealth of information about the affect ofperturbations on the cell's biological state. The collection of suchmeasurements is generally referred to as the “profile” of the cell'sbiological state.

There may be on the order of 100,000 different cellular constituents formammalian cells. Consequently, the profile of a particular cell istypically complex. The profile of any given state of a biological systemis often measured after the biological system has been subjected to aperturbation. Such perturbations include experimental or environmentalconditions(s) associated with a biological system such as exposure ofthe system to a drug candidate, the introduction of an exogenous gene,the deletion of a gene from the system, or changes in cultureconditions. Comprehensive measurements of cellular constituents, orprofiles of gene and protein expression and their response toperturbations in the cell, therefore have a wide range of utilityincluding the ability to compare and understand the effects of drugs,diagnose disease, and optimize patient drug regimens. In addition, theyhave further application in basic life science research.

Within the past decade, several technological advances have made itpossible to accurately measure cellular constituents and thereforederive profiles. For example, new techniques provide the ability tomonitor the expression level of a large number of transcripts at any onetime (see, e.g., Schena et al., 1995, Quantitative monitoring of geneexpression patterns with a complementary DNA micro-array, Science270:467-470; Lockhart et al., 1996, Expression monitoring byhybridization to high-density oligonucleotide arrays, NatureBiotechnology 14:1675-1680; Blanchard et al., 1996, Sequence to array:Probing the genome's secrets, Nature Biotechnology 14, 1649; U.S. Pat.No. 5,569,588, issued Oct. 29, 1996 to Ashby et al. entitled “Methodsfor Drug Screening”). In organisms for which the complete genome isknown, it is possible to analyze the transcripts of all genes within thecell. With other organisms, such as humans, for which there is anincreasing knowledge of the genome, it is possible to simultaneouslymonitor large numbers of the genes within the cell.

In another front, the direct measurement of protein abundance has beenimproved by the use of microcolumn reversed-phase liquid chromatographyelectrospray ionization tandem mass spectrometry (LC/MS/MS) to directlyidentify proteins contained in mixtures. This technology promises topush the dynamic range for which protein abundance can be measured in abiological system. Using LC/MS/MS, McCormack et al. have demonstratedthat proteins presented in system mixtures can be readily identifiedwith a 30-fold difference in molar quantity, that the identificationsare reproducible, and that proteins within the mixture can be identifiedat low femtomole levels. McCormack et al., 1997, Direct analysis andidentification of proteins in mixtures by LC/MS/MS and databasesearching at the low-femtomole level, Anal. Chem. 69:767-776. In areview of tandem mass spectrometry, Chait points out that an additionaladvantage of this technology is that it is orders of magnitude fasterthan more conventional approaches such as Edman sequencing. Chait, 1996,Trawling for proteins in the post-genome era, Nat. Biotech, 14:1544.

Other technological advances have provided for the ability tospecifically perturb biological systems with individual geneticmutations. For example, Mortensen et al. describe a method for producingembryonic stem (ES) cell lines whereby both alleles are inactivated byhomologous recombination. Using the methods of Mortensen et al, it ispossible to obtain homozygous mutationally altered cells, i.e., doubleknockouts of ES cell lines. Mortensen et al. propose that their methodmay be generally applicable to other genes and to cell lines other thanES cells. Mortensen et al 1992, Production of homozygous mutant ES cellswith a single targeting construct, Cell Biol. 12:2391-2395.

In another promising technology Wach et al. provide a dominantresistance module for selection of S. cerevisiae transformants whichentirely consists of heterologous DNA. The module can also be used toprovide PCR based gene disruptions. Wach et al., 1994, New heterologousmodules for classical or PCR-based gene disruptions in Saccharomycescerevisiae, Yeast 10:1793-808.

Technological advances, such as the use of microarrays, are alreadybeing used in drug discovery (See e.g. Marton et al., 1998, Drug targetvalidation and identification of secondary drug target effects usingMicroarrays, Nature Medicine in press; Gray et al., 1998, Exploitingchemical libraries, structure, and genomics in the search for kinaseinhibitors, Science 281:533-538).

Comparison of profiles with other profiles in a database (see, e.g.,U.S. Pat. No. 5,777,888, issued Jul. 7, 1998 to Rine et al. entitled“Systems for generating and analyzing stimulus-response output signalmatrices”) or clustering of profiles by similarity can give clues to themolecular targets of drugs and related functions, efficacy and toxicityof drug candidates and/or pharmacological agents. Such comparisons mayalso be used to derive consensus profiles representative of ideal drugactivities or disease states. Profile comparison can also help detectdiseases in a patient at an early stage and provide improved clinicaloutcome projections for a patient diagnosed with a disease.

2.2 Fluorophore Bias

The use of two fluorophores has been described by Shalon et al. Shalonet al, 1996, A microarray system for analyzing complex DNA samples usingtwo-color fluorescent probe hybridization, Genome Research 6:629-645.The problem with the approach put forth by Shalon is that each speciesof mRNA molecule has a bias in its measured color ratio due tointeraction of the fluorescent labeling molecule with either the reversetranscription of the mRNA or with the hybridization efficiency or both.Without any error correction scheme to account for this bias, the datafrom a single microarray experiment, or even a plurality of nominalrepeats of a microarray experiment in which the various results areaverages, will produce an unacceptable error rate. As used herein, theterm nominal repeat or nominally repeated experiment refers toexperiments that are run under essentially the same or similarexperimental conditions such that it would be useful to combine theresults of the repeated experiments.

2.3 Inherent Error Rates of Cellular Constituent QuantitativeMeasurement Experiments

While the technological advances have allowed for the generation ofquantitative measurements of the levels cellular constituents, theexperiments are expensive. A single microarray experiment, or a singlegel electrophoresis place, can cost in the neighborhood of $100-$1000and higher. Also, it has only become apparent after many initialattempts to apply the data to actual commercial needs that individualexperiments suffer from high levels of false positives in the sense ofdeclaring significance where there really is none. Because of theexpense involved, and the high rate of false positives, no descriptionof robust methods for repeating and statistically combining multiple,nominally identical experiments for the express purpose of data qualityimprovement have been provided in the prior art.

The power of genome-wide cell profiling accomplished with microarrays isin its ability to survey response to known perturbations acrossessentially the entire set of cellular mechanisms. However, in any givenexperiment, typically only a small number of cellular constituents mayhave dramatic changes in abundance, where the vast majority areunchanged. There are exceptions, but cells have specific, biologicallyfairly insulated responses to stimuli, and so most profiles involve alarge set of constituents with ‘no-change’, and a much smaller set thatare either up or down regulated. For this reason, even a small falsealarm rate in the measurements can severely compromise their utility.For example, if one percent of cellular constituents actually respond ina typical experiment, the resolution in the measurement is twofold, andthe errors exceed twofold one percent of the time, then there will be asmany false alarms as true detections above a twofold threshold.

In general, the art has underappreciated the extensive amount of errorsthat are present in individual cellular constituent quantificationexperiments such as microarray or protein gel experiments. In additionto the difficulty posed by the fairly insulated response biologicalsystems have to any given perturbation, a substantial amount of error ispresent in any nominal microarray experiment due to artifacts such asunevenly printed DNA probe spots on the microarray, scratches dust andartifacts on the microarray; uneveness in signal brightness across themicroarray due to nonuniform DNA hybridization, and color stripes duefluorophore-specific biases of fluorophores used in the microarrayprocess.

One method to reduce the effects of these serious errors is to repeatthe experiment under identical conditions and to average the data.However, simple averaging of the data without any consideration of thenature of the underlying experimental errors does not provide anadequate solution to the problems the experimental errors introduce. Ifonly simple averaging of the data is performed, an excessive number ofnominal repeats would be required in order to reduce the effects oferror down to an acceptable level. However, because of the expenseinvolved in performing each cellular constituent quantificationexperiment, this is not a feasible solution. Accordingly, what is neededin the art are robust methods for combining the experimental results ofrepeated cellular constituent quantification experiments so that aminimal set of nominal repeats can provide an acceptable error rate.

Discussion or citation of a reference herein shall not be construed asan admission that such citation is prior art to the present invention.

3 SUMMARY OF THE INVENTION

This invention provides solutions for minimizing the number of times acellular constituent quantification experiment must be repeated in orderto produce data that has acceptable error levels. Accordingly, themethods of the present invention provide a novel method for fluorophorebias removal. This allows for the attenuation of fluorophore specificbiases to acceptable levels based on only two nominal repeats of acellular constituent quantification experiment. The present inventionfurther provides methods for combining nomimal repeats of a cellularconstituent quantification experiment based on rank order ofup-regulation or down-regulation. In these methods, cellular constituentup- or down-regulation data determined from nominal repeats of cellularconstituent quantification experiments are expressed by a novel metricthat is free of intensity dependent errors. Application of this metricbefore combining based on rank order provides a powerful method forremoving error from weakly expressing cellular constituents without anexcessive number of nominal repetitions of the expensive cellularconstituent quantification experiment.

Another aspect of the present invention is an improved method forcomputing a weighted average of individual cellular constituentmeasurements in nominally repeated cellular constituent quantificationexperiments. In particular, a novel method for calculating the errorassociated with each cellular constituent measurement is provided. Byusing this novel method for calculating error, the error bar in theweighted average is sharply attenuated. One skilled in the art willappreciate that these improved methods for computing a weighted averageare applicable to two-fluorophore (two-color) or single fluorophore(one-color) protocols.

One embodiment of the present invention provides a method of fluorophorebias removal comprising the steps of:

-   (a) labeling a first pool of genetic matter, derived from a    biological system representing a baseline state, with a first    fluorophore to obtain a first pool of fluorophore-labeled genetic    matter,-   (b) labeling a second pool of genetic matter, derived from a    biological system representing a perturbed state, with a second    fluorophore to obtain a second pool of fluorophore-labeled genetic    matter;-   (c) labeling a third pool of genetic matter, derived from said    biological system representing said baseline state, with said second    fluorophore to obtain a third pool of fluorophore-labeled genetic    matter;-   (d) labeling a fourth pool of genetic matter, derived from said    biological system representing said perturbed state, with said first    fluorophore to obtain a fourth pool of fluorophore-labeled genetic    matter;-   (e) independently contacting said first pool of fluorophore-labeled    genetic matter and said second pool of fluorophore-labeled genetic    matter with a first microarray under conditions such that    hybridization can occur and determining a first color ratio between    said first pool of fluorophore-labeled genetic matter and said    second pool of fluorophore-labeled genetic matter that binds to said    microarray;-   (f) independently contacting said third pool of fluorophore-labeled    genetic matter and said fourth pool of fluorophore-labeled genetic    matter with a second microarray under conditions such that    hybridization can occur and determining a second color ratio between    said third pool of fluorophore-labeled genetic matter and said    fourth pool of fluorophore-labeled genetic matter, and-   (g) computing an average color ratio by averaging said first color    ratio and said second color ratio.

Another embodiment of the invention provides a method for determining aprobability that an expression level of a cellular constituent in aplurality of paired differential microarray experiments is altered by aperturbation, wherein each paired differential microarray experiment insaid plurality of paired differential microarray experiments comprises afirst microarray experiment representing a baseline state of a firstbiological system, and a second microarray experiment representing aperturbed state of said first biological system, said method comprisingthe steps of

-   (a) determining an error distribution statistic by fitting a    reference pair of microarray experiments with an intensity    independent statistic, wherein said reference pair of microarray    experiments comprises a first reference microarray experiment, and a    second reference microarray experiment that is a nominal repeat of    said first reference microarray experiment;-   (b) selecting said cellular constituent from a get of cellular    constituents measured in said plurality of paired differential    microarray experiments, and, for each paired differential microarray    experiment in said plurality of paired differential microarray    experiments, determining an amount of change in expression level of    said cellular constituent between the second microarray experiment    and the first microarray experiment of said paired differential    microarray experiment using said error distribution statistic; and-   (c) determining said probability that said expression level of said    cellular constituent in said plurality of paired differential    microarray experiments is altered by said perturbation by combining    said amount of change in expression level of said cellular    constituent determined in step (b) for each paired differential    microarray experiment in said plurality of paired differential    microarray experiments using a rank based method.

Yet another embodiment of the invention is a method for determining aweighted mean differential intensity in an expression level of acellular constituent in a biological system in response to aperturbation, the method comprising:

-   -   (a) determining an error distribution statistic by fitting a        reference microarray experiment pair with an intensity        independent statistic, wherein the reference microarray        experiment pair comprises a first reference microarray        experiment and a second reference microarray experiment which is        a nominal repeat of the first reference microarray experiment;    -   (b) determining an amount of differential expression of the        cellular constituent a plurality of times;    -   (c) for each amount of differential expression determined in        accordance with (b), calculating a corresponding amount of error        based on a magnitude derived by the error distribution        statistic; and    -   (d) computing the weighted mean differential intensity by        inversely weighting each amount of the differential expression        of the cellular constituent determined in step (b) by the        corresponding amount of error determined in step (c) according        to the formula        $x = \frac{\sum\left( {x_{i}/\sigma_{i}^{2}} \right)}{\sum\left( {1/\sigma_{i}^{2}} \right)}$        where x is the weighted mean differential intensity of the        cellular constituent, x_(i) is a differential expression        measurement of the cellular constituent i and σ_(i) ² is a        corresponding error for mean differential intensity x_(i).

4 BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 depicts some sources of measurement error present in microarrayfluorescent images. (A) depicts unevenly printed DNA probe spots. (B)depicts the effects of scratches, dust, and artifacts. (C) depicts howspot positions drift away from a nominal measuring grid. (D) depicts theeffects of unevenness in the brightness across the microarray due touneven hybridization. (E) depicts the effects of color stripes on themicroarray due to fluorophore-specific biases.

FIG. 2 illustrates the effect of deleting genes responsible for theproduction of calcineurin protein in the yeast S. Cerevisiae (CNA1 andCNA2). The figure contrasts the response profile of two yeast cultures,a native culture (Culture 1) and a culture in which CNA1 and CNA2 havebeen deleted (Culture 2). The horizontal axis is the log₁₀ of theintensity of the individual hybridized spots on the microarrary obtainedfrom the two yeast cultures, and therefore represents mRNA speciesabundance. The vertical axis is the log₁₀ of the ratio of the intensitymeasured for one fluorescent label (Culture 1) to that measured for theother label (Culture 2) (expression ratio). True signature genes of aCNA1/CNA2 mutation are identified as those deviating significantly fromthe log₁₀ (expression ratio)=0 line and are labeled.

FIG. 3 depicts the intensity-dependent bias that occurs in cellexpression profile experiments due to variance in fluorophore opticaldetection efficiencies as well as variance in fluorophore incorporationefficiencies.

FIG. 4A is a color ratio vs. intensity plot for an experiment in whichboth cultures were the same background strain of the yeast S.Cerevisiae. Genes with a distinct bias between a red and greenfluorophore are flagged. FIG. 4B is the same experiment as depicted inFIG. 4A except that usage of the red and green fluorophores is reversed.FIG. 4C depicts the bias removal process of the invention, wherein FIG.4A and FIG. 4B are combined to produce a response profile free offluorophore-specific biases.

FIG. 5. compares two identical response profiles that were performedunder identical experimental conditions. The figure shows thatexperimental errors decrease as a function of intensity (expressionlevel). Intensity independent contour lines illustrate a component ofthe error correction methods of the present invention.

FIG. 6 a shows a typical signature plot for a single experiment with thedrug Cyclosporin A. FIG. 6 b shows the results of applying a weightedaverage according to the methods of the present invention to fourrepeats of the experiment depicted in FIG. 6 a.

FIG. 7 illustrates a computer system useful for embodiments of theinvention.

5 DETAILED DESCRIPTION OF THE INVENTION

5.1. Introduction and General Definitions

Perturbation: As used herein, a perturbation is the experimental orenvironmental condition(s) associated with a biological system.Perturbations may be achieved by exposure of a biological system to adrug candidate or pharmacologic agent, the introduction of an exogenousgene into a biological system, the deletion of a gene from thebiological system, changes in the culture conditions of the biologicalsystem, or any other art recognized method of perturbing a biologicalsystem. Further, perturbation of a biological system may be achieved bythe onset of disease in the biological system.

Genetic Matter: As used herein, the term “genetic matter” refers tonucleic acids such as messenger RNA (“mRNA”), complementary DNA(“cDNA”), genomic DNA (“gDNA”), DNA, RNA, genes, oligonucleotides, genefragments, and any combination thereof.

Fluorophore-labeled genetic matter: As used herein, the term“fluorophore-labeled genetic matter” refers to genetic matter that hasbeen labeled with a fluorescently-labeled probe (“fluorophore”).Fluorophores include, but are not limited to, fluorescein, lissamine,phycoerythrin, rhodamine (Perkin Elmer Cetus), Cy2, Cy3, Cy3.5, Cy5,Cy5.5, Cy7, Fluor X (Amersham) and others (see, e.g., Kricka, 1992,Nonisotopic DNA Probe Techniques, Academic Press San Diego, Calif.).This DNA may be prepared by reverse transcription of mRNA or by(PCR/IVT) or (IVT) with use of fluorophores as those skilled in the artwill appreciate. See e.g. Gelder et al., 1990, “Amplified RNAsynthesized from limited quantities of heterogenous cDNA, Proc. Natl.Acad. Sci., USA, 87:1663-1667). As used herein, the term PCR refers tothe Polymerase Chain Reaction.

Biological System: As used herein, the term “biological system” isbroadly defined to include any cell, tissue, organ or multicellularorganism. For example, a biological system can be a cell line, a cellculture, a tissue sample obtained from a subject, a Homo sapien, amammal, a yeast substantially isogenic to Saccharomyces cerevisia, orany other art recognized biological system. The state of a biologicalsystem can be measured by the content, activities or structures of itscellular constituents. The state of a biological system, as used herein,is determined by the state of a collection of cellular constituents,which are sufficient to characterize the cell or organism for anintended purpose including characterizing the effects of a drug or otherperturbation. The term “cellular constituent” encompasses any kind ofmeasurable biological variable. The measurements and/or observationsmade on the state of these constituents can be of their abundances(i.e., amounts or concentrations in a biological system), theiractivities; their states of modification (e.g., phosphorylation), orother art recognized measurements relevant to the physiological state ofa biological system. In various embodiments, this invention includesmaking such measurements and/or observations on different collections ofcellular constituents. These different collections of cellularconstituents are also called aspects of the biological state of abiological system.

One aspect of the biological state of a biological system (e.g., a cellor cell culture) usefully measured in the present invention is itstranscriptional state. The transcriptional state of a biological systemincludes the identities and abundances of the constituent RNA species,especially mRNAs, in the cell under a given set of conditions. Often, asubstantial fraction of all constituent RNA species in the biologicalsystem are measured, but at least a sufficient fraction is measured tocharacterize the action of a drug or other perturbation of interest. Thetranscriptional state of a biological system can be convenientlydetermined by measuring cDNA abundances by any of several existing geneexpression technologies. DNA arrays for measuring mRNA or transcriptlevel of a large number of genes can be employed to ascertain thebiological state of a system.

Another aspect of the biological state of a biological system usefullymeasured is its translational state. The translational state of abiological system includes the identities and abundances of theconstituent protein species in the biological system under a given setof conditions. Preferably a substantial fraction of all constituentprotein species in the biological system is measured, but at least asufficient fraction is measured to characterize the action of a drug ofinterest. The transcriptional state is often representative of thetranslational state.

Other aspects of the biological state of a biological system are also ofuse in this invention. For example, the activity state of a biologicalsystem includes the activities of the constituent protein species (andalso optionally catalytically active nucleic acid species) in thebiological system under a given set of conditions. As is known to thoseof skill in the art, the translational state is often representative ofthe activity state.

This invention is also adaptable, where relevant, to “mixed” aspects ofthe biological state of a biological system in which measurements ofdifferent aspects of the biological state of a biological system arecombined. For example, in one mixed aspect, the abundances of certainRNA species and of certain protein species, are combined withmeasurements of the activities of certain other protein species.Further, it will be appreciated from the following that this inventionis also adaptable to any other aspect of a biological state of abiological system that is measurable.

The biological state of a biological system (e.g., a cell or cellculture) can be represented by a profile of some number of cellularconstituents. Such a profile of cellular constituents can be representedby the vector S.S=[S₁, . . . S_(i), . . . S_(k)]Where S_(i) is the level of the i'th cellular constituent, for example,the transcript level of gene i, or alternatively, the abundance oractivity level of protein i.

Quantitative Measurement of Cellular Constituents: MicroarraysDetermining the relative abundance of diverse individual sequences incomplex DNA samples is often accomplished using microarrays. See e.g.Shalon et al., 1996, “A Microarray System for Analyzing Complex SamplesUsing Two-color Fluorescent Probe Hybridization, Genome Research6:639-645). Frequently, transcript arrays are produced by hybridizingdetectably labeled polynucleotides representing the mRNA transcriptspresent in a cell (e.g., fluorescently labeled cDNA synthesized fromtotal cell mRNA) to a microarray. A microarray is a surface with anordered array of binding (e.g., hybridization) sites for products ofmany of the genes in the genome of a cell or organism, preferably mostor almost all of the genes. Microarrays are highly reproducible andtherefore multiple copies of a given array can be produced and thenominal copies can be compared with each other. Preferably microarraysare small, usually smaller than 5 cm², and made from materials that arestable under binding (e.g., nucleic acid hybridization) conditions. Agiven binding site or unique set of binding sites in the microarray willspecifically bind the product of a single gene in the cell.

When cDNA complementary to the RNA of a cell is made and hybridized to amicroarray under suitable hybridization conditions, the level ofhybridization to the site in the array corresponding to any particulargene will reflect the prevalence in the cell of mRNA transcribed fromthat gene. For example, when detectably labeled (e.g., with afluorophore) cDNA complementary to the total cellular mRNA is hybridizedto a microarray, the site on the array corresponding to a gene (i.e.,capable of specifically binding the product of the gene) that is nottranscribed in the cell will have little or no signal (e.g., fluorescentsignal), and a gene for which the encoded mRNA is prevalent will have arelatively strong signal.

Microarrays are advantageous because nucleic acids representing twodifferent pools of nucleic acid can be hybridized to a microarray andthe relative signal from each pool can simultaneously be measured. Eachof pool of nucleic acids may represent the state of a biological systembefore and after a perturbation. For example, a first nucleic acid poolmay be derived from a mRNA pool from a cell culture before exposing thecell culture to a pharmacological agent and a second cDNA pool may bederived from a mRNA pool derived from the same culture after exposingthe culture to a pharmacological agent. Alternatively, the two pools ofcDNA could represent pathway responses. Thus, a first DNA library couldbe derived from the mRNA of a first aliquot (“pool”) of a cell culturethat has been exposed to a pathway perturbation and a second cDNAlibrary can be derived from the mRNA of a second aliquot (“pool”) of thesame cell culture wherein the second aliquot was not exposed to thepathway perturbation. As used herein, microarray experiments, includingthose described in this section, are referred to as (“differentialmicroarray experiments”). One skilled in the art will appreciate thatmany forms of differential microarray experiments other than the onesoutlined in this disclosure are within the scope of the definition of“differential microarray experiments”. Further, as used herein, the term“differential intensity measurement” refers to measurements made indifferential microarray experiments. For example, a differentialintensity measurement could be the difference between the brightness ofa position on a microarray, which corresponds to a cellular constituentof interest, after (i) the microarray has been contacted with DNAderived from a biological system that represents a baseline state and(ii) the microarray has been contacted with DNA derived from abiological system that represents a perturbed state. Further, oneskilled in the art will appreciate that the baseline state of abiological system may represent the wild-type state of the biologicalsystem. Alternatively, the baseline state of a biological system couldrepresent a different perturbed state of the biological system. Eachmicroarray experiment in a differential microarray experiment, orrepeated differential microarray experiment preferably utilizes the sameor similar microarray. Microarrays are considered similar if they areprepared from substantially isogenic biological systems and a majorityof the binding spots on each microarray are common. Thus, the microarrayused in repeated microarray experiments may be the same identicalmicroarray, wherein the microarray is washed between microarrayexperiments, or the microarray(s) used in repeated microarrayexperiments may be exact replicas of each other, or they may similar toeach other.

Regardless of the source of the two cDNA pools in differentialmicroarray experiments, each cDNA pool is distinctively labeled with adifferent dye if the two-fluorophore microarray format is chosen. Oneskilled in the art will appreciate that certain aspects of the presentinvention are not limited to the two-fluorophore format. Typically, eachcDNA pool is labeled by deriving fluorescently-labeled cDNA by reversetranscription of polyA ⁺RNA in the presence of Cy3-(green) or Cy5-(red)deoxynucleotide triphosphates (Amersham). When the two cDNAs pools aremixed and hybridized to the microarray, the relative intensity of signalfrom each cDNA set is determined for each site on the array, and anyrelative difference in abundance of a particular mRNA detected.

When two different fluorescently labeled probes are used, such as CY3and CY5, the fluorescence emissions at each site of a microarray can bedetermined using scanning confocal laser microscopy. In one embodiment,a separate scan, using the appropriate excitation line, is carried outfor each of the two fluorophores used. Alternatively, a laser may beused that allows simultaneous specimen illumination at wavelengthsspecific to the two fluorophores and emissions from the two fluorophorescan be analyzed simultaneously (See e.g. Shalon et al., supra). Themicroarrays may be scanned with a laser fluorescent scanner with acomputer controlled X-Y stage and a microscope objective. Sequentialexcitation of the two fluorophores is achieved with a multi-line, mixedgas laser and the emitted light is split by wavelength and detected withtwo photomultiplier tubes. Fluorescence laser scanning devices aredescribed in Schena et al., 1996, Genome Res. 6:639-645 and inreferences cited herein. Alternatively, the fiber-optic bundle describedby Ferguson et al., 1996, Nature Biotech. 14:1681-1684, may be used tomonitor mRNA abundance levels at a large number of sites simultaneously.

Signals may be recorded and analyzed by computer, e.g., using a 12 bitanalog to digital board. The scanned image may be despeckled using agraphics program (e.g., Hijaak Graphics Suite) and then analyzed usingan image gridding program that creates a spreadsheet of the averagehybridization at each wavelength at each site. If necessary, anexperimentally determined correction for “cross talk” (or overlap)between the channels for the two fluorophores may be made.

As used herein, the term “microarray experiments” refers to the generalclass of experiments that are described in this section. One skilled inthe art will appreciate that microarray experiments may include the useof a single fluorophore rather than the two-fluorophore exampledescribed infra. Further, microarray experiments may be paired. Ifpaired, the first microarray experiment in the pair could represent anominal biological system representing a baseline state. The secondmicroarray experiment in the pair could represent the nominal biologicalsystem after it has been subjected to a perturbation. Thus comparison ofthe paired microarray experiment would reveal changes in the state ofthe nominal biological system based upon the perturbation. Generally, asdiscussed, supra, these pairs of microarray experiments are referred toas “differential microarray experiments”.

Cell Expression Profiles An advantage of using two different cDNA poolsin microarray experiments is that a direct and internally controlledcomparison of the mRNA levels corresponding to each arrayed gene in twocell states can be made. This and related techniques for quantitativemeasurement of cellular constitutents is generally referred to as cellconstituent profiling. Cell constituent profiling is typically expressedas changes, either in absolute level or the ratio of levels, between twoknown cell conditions, such as a response to treatment of a baselinestate with a pharmacological agent, as described in the previoussection.

Using the experimental procedures outlined in the preceding section, aratio of the emission of the two fluorophores may be calculated for anyparticular hybridization site on a DNA transcript array. This ratio isindependent of the absolute expression level of the cognate gene, but isuseful for genes whose expression is significantly modulated by drugadministration, gene deletion, or any other perturbation. As illustratedin FIGS. 2-6, two-fluorophore cell expression profiles are typicallyplotted on an x-y graph. The horizontal axis represents the log₁₀ of theratio of the mean intensity (which approximately reflects the level ofexpression of a corresponding mRNA derived from a gene) between thefirst and second pool of cDNA for each site on the microarray. Thevertical axis represents the log₁₀ of the ratio of the intensitymeasured for one fluorescent label, corresponding to the first pool ofcDNA, to that measured for the other fluorescent label, corresponding tothe second pool of cDNA, for each hybridization site on the microarray.

5.2 Fluorophore Bias Removal

As detailed in the background section the two-color fluorescenthybridization process put forth by Shalon et al., supra, introduces biasinto the profile analysis because each species of mRNA that is labeledwith fluorophore has a bias in its measured color ratio due tointeraction of the fluorescent labeling molecule (fluorophore) witheither the reverse transcription of the mRNA or with the hybridizationefficiency or both. This bias can be illustrated using the followingequations. If we represent the actual molecular abundance of aparticular species of mRNA k, representing cellular constituent or genek in the biological system of interest, as a(k), the color ratio forprobe k, ignoring any source of fluorophore bias may be represented as:r _(X/Y) =a ₁(k)/a ₂(k)  (1)where

-   -   the subscripts 1 and 2 refer to two independently extracted mRNA        cultures in which abundances are being compared;    -   a₁(k) is the abundance of species k in mRNA culture 1;    -   a₂(k) is the abundance of species k in mRNA culture 2;    -   subscripts X and Y represent the two different fluorescent        labels used; and    -   r_(X/Y) is the color ratio that ideally reflects abundance ratio        a₁/a₂.

Equation (1) ideally represents the measurement plotted on the verticalaxis of FIGS. 2 thru 6. However the use of a fluorophore labeleddeoxynucleotide triphosphates affects the efficiency by which mRNA isreverse transcribed into cDNA and affects the efficiency to which theflourophore-labeled cDNA hybridizes to the microarray. The preciseamount a specific fluorophore affects the transcription or hybridizationefficiency is highly dependent upon the precise molecular structure ofthe fluorophore used. Thus, a direct comparison of a₁(k) to a₂(k), whena₁(k) and a₂(k) are determined using different fluorophores, does notaccount for these fluorophore-specific affects on transcription andhybridization efficiency. The efficiency of a scanner at determining theabundances a₁(k) and a₂(k) on a microarray is also fluorophore specific.If we represent the combined efficiencies of particular fluorophore inextraction, labeling, reverse transcription, hybridization, and opticalscanning as E, a more realistic representation of the color ratiopresented in Equation 1 is:r _(X/Y) =a ₁(k)E _(X)(k)/a ₂(k)E _(Y)(k)  (2)where

-   -   r_(X/Y) is color ratio;    -   the subscripts 1 and 2 are as defined for equation 1;    -   a₁(k) and a₂(k) are as defined for equation 1;    -   subscripts X and Y are two fluorescent labels;    -   E_(X)(k) is the efficiency of flourescent label X; and    -   E_(Y)(k) is the efficiency of flourescent label Y.

In equation 2, Culture 1 has been analyzed using fluorophore X whereasCulture 2 has been analyzed using fluorophore Y. Now the color ratio ris related to the desired abundance ratio a₁/a₂ but includes a factordue to the fluorophore specific efficiency biases. If a secondhybridization experiment is performed, wherein Culture 1 is now analyzedwith fluorophore Y and Culture 2 is analyzed using fluorophore X, thecolor ratio in the second hybridization experiment may be representedas:r _(X/Y) ^((rev)) =a ₂(k)E _(X)(k)/a ₁(k)E _(Y)(k)  (3)where

-   -   r_(X/Y) ^((rev)) is color ratio in the reverse experiment; and    -   a₂(k), a₁(k), E_(X)(k), and E_(Y)(k) are as described for        equation (2).

Performing hybridization experiments in pairs, with the label assignmentreversed in one member of the pair, allows for creation of a combinedaverage measurement in which the fluorophore specific bias is sharplyreduced. For example a pair of two-flourophore hybridization experimentsmay be performed. The first two-fluorophore experiment would beperformed in accordance with equation (2) and the second two-fluorophorehybridization experiments would be performed according to equation (3).If the log of the ratio of the two experiments is taken, the combinedexperiment can be expressed as: $\begin{matrix}\begin{matrix}{{\left( {1/2} \right)\left( {{\log\left( r_{X/Y} \right)} - {\log\left( r_{X/Y}^{({rev})} \right)}} \right)} = {{\log\left( {{a_{1}(k)}/{a_{2}(k)}} \right)} +}} \\{\left( {{\log\left( {{E_{X}(k)}/{E_{Y}(k)}} \right)} -} \right.} \\{\log\left( {{E_{X}(k)}/{E_{Y}(k)}} \right)} \\{= {\log\left( {{a_{1}(k)}/{a_{2}(k)}} \right)}}\end{matrix} & (4)\end{matrix}$which is the desired log abundance ratio. Cancellation of the bias termslog(E_(X)(k)/E_(Y)(k)) and log(E_(X)(k)/E_(Y)(k)) relies on constancy ofthe biases between the first and second hybridization experiments ineach fluorophore-reversed pair. Equation (4) can be written equivalentlyusing ratios as found in equations (1)-(3) instead of differences of logratios. However, changes in constituent levels are most appropriatelyexpressed as the logarithm of the ratio of abundance in the pair ofconditions forming the differential measurement. This is because foldchanges are more meaningful than changes in absolute level,biologically.

This method of bias removal is particularly useful in two-colorhybridization experiments. FIG. 4 illustrates the bias removal method ofthe present invention. FIG. 4 a is a color ratio vs. intensity plot fora two-color hybridization experiment in which the two cultures used arenominally the same background strain of the yeast S. Cerevisiae. Becausethe two cultures are nominally the same, it is expected that individualspots on the microarray would flouresce with the same amount ofintensity for both of the fluorophores used. Experimental methods aredescribed in the experimental section infra. However, as is readilyapparent from FIG. 4 a, some of the spots on the microarray exhibitfluorophore-specific intensity. For example, spots on the microarray,corresponding to various genes in the yeast S. Cerevisiae, in which theintensity of the ‘red’ fluorophore is factor of 2 or more greater thanthe corresponding ‘green’ intensity are flagged because of their strongflourophore-specific bias. FIG. 4 b shows the result of thefluorophore-reversed version of the experiment plotted in FIG. 4 a. Theflagged genes in FIG. 4 b now have opposite bias. FIG. 4 c shows theresult of combining the data of FIGS. 4 a and 4 b according to themethods of the present invention described above. The biases of theflagged genes have been greatly reduced.

The procedure for bias removal as described above may be applied inother contexts. For example, if cultures must be grown at certainpositions in an incubator, and harvested in a certain order, thepositions and order for two culture types may be reversed in asubsequent experiment and the results combined as described to reducesubtle biases due to temperature or latency differences.

5.3 Combination of Multiple Experiments Using Rank-Based Methods

The prior art does not provide a clear method for optimally combiningthe results of multiple microarray experiments. The results of severalexperiments could be averaged. However, averaging does not provideinformation on the statistical significance of any given measurement foreach specific gene of interest in the microarray experiments. Thissection develops a sophisticated method for determining whether thestatistical significance of the up- or down-regulation measured forparticular genes of interest in multiple microarray experiments. Thesemethods could be applied to nominal repeats of a two-fluorophore DNAmicroarray experiment. Alternatively, these methods could be applied toone or more repeats of pairs of experiments, in which the firstexperiment in the pair represents a baseline state and the second memberof the paired repeats represents a biological state after a perturbationhas been applied.

If a gene of interest is present in the top 5% of up regulations in afirst and second nominal repeat of a microarray experiment, the chancethat it appeared that up regulated by chance in both arrays is only0.05*0.05=0.0025 or 0.25%, assuming systematic biases have been removed.Thus repeating the measurement allows a much higher level of confidencein declaring that the gene of interest is up regulated. In general, ifexpression ratios in any number of repeated experiments are expressed aspercentile rankings, the chance P(H₀) that any (pre-specified) gene ofinterest is not actually tip regulated is $\begin{matrix}{{P\left( H_{0}^{+} \right)} = {\prod\limits_{i}P_{i}}} & (5)\end{matrix}$where P_(i) is the percentile rank in the i'th experiment, expressed asa fraction (fifth percentile=0.05). The probability that the gene is notdown-regulated is given by $\begin{matrix}{{P\left( H_{0}^{-} \right)} = {\prod\limits_{i}\left( {1 - P_{i}} \right)}} & (6)\end{matrix}$

These rank-based methods provide a powerful way of reducing false alarmswith repeated measurements. For example, setting a threshold at theupper 5% of expression ratios in a hybridization to probes covering theyeast genome, which has approximately 6000 genes, would yield−6000*−0.05=300 false detections in a single experiment, but less thanone false detection on average if the same 5% threshold were appliedacross four experiment repeats (6000*(0.05)⁴). This rank combining hasthe advantage that it does not require any modeling of the detailederror behavior in the underlying hybridization experiments, other thanthe assumption of no systematic biases. The rank based method is anexample of a non-parametric statistical test for the significance ofobserved up- or down-regulations.

Percentile rankings such as equations (5) and (6) are based upon theassumption that the underlying error behavior is similar for all genes.This is not necessarily the case. For example, in FIG. 5, which plotsthe expression ratio of two nominative repeats of the same experiment,the weakly expressing genes, as expressed by log₁₀(intensity), have alog₁₀(expression ratio) that deviate from the ideal value of zero.Further, as exhibited by FIG. 5, the weaker expressing a particular geneis, the higher the tendency of the log₁₀(expression ratio) of the genefrom two nominal repeats of an experiment to deviate from zero. Thus,the low-abundance (weakly expressing and hence low-intensityhybridization) genes will tend to occupy the tails of the distributionof expression ratios (i.e. deviate from zero in accordance with FIG. 5)more often than the higher-abundance genes.

To account for the intensity-dependent error exhibited in hybridizationexperiments such as the one illustrated by FIG. 5, a measure of up- anddown-regulation that makes the error level independent of intensity canbe devised. This intensity-independent error level is derived by takingadvantage of a statistic that is capable of characterizing the errorenvelope exhibited in hybridization experiments. This error envelope isillustrated in FIG. 5 by contour lines. The many sources of error thatunderlie the experiments used to generate plots such as shown in FIG. 5generally fall into two categories—additive and multiplicative.Therefore the following statistical representation $\begin{matrix}{d = \frac{\left( {X - Y} \right)}{\sqrt{\sigma_{X}^{2} + \sigma_{Y}^{2} + {f^{2}\left( {X^{2} + Y^{2}} \right)}}}} & (7)\end{matrix}$where X and Y are the brightness for a probe spot on the microarray withrespect to the X and Y fluorophores), σ_(X) ² is a variance term for Xand represents the additive error level in the X channel, σ_(Y) ² is avariance term for Y and represents the additive error level in the Ychannel, and f is the fractional multiplicative error level, provides aparticularly well suited model for fitting the resultant error.Alternatively, X and Y are the brightness of a probe spot correspondingto a cellular constituent of interest derived from a pair ofsingle-fluorophore experiments. In one such embodiment, the firstfluorophore (X) may optionally represent a biological system in a baseline state whereas the second fluorophore (Y) may represent thebiological system in a perturbed state. Regardless of whether a singlefluorophore or a dual-fluorophore embodiment is chosen, the fractionalmultiplicative error, f, is empirically derived by fitting thedenominator of equation 7 to the measured data. The denominator ofEquation (7) is the expected standard error of the numerator, so d hasunit variance. d is therefore an error distribution statistic that isindependent of intensity, and therefore applicable to rank methods. Anyother definition with the non-parametric properties of equation (7) isalso a good variable to use in the rank methods.

According to the methods of the present invention, the denominator ofequation (7) is used to generate the intensity independent contour linesshown in FIG. 5. Thus, for example, in FIG. 5, the contour lines griddedat ±1 standard deviation have been chosen. Therefore, each contour lineabove or below zero on the vertical axis (log(Expression Ratio)=0)represents an incremental standard deviation of error in accordance withthe denominator of equation (7). The choice of using grid lines of ±1standard deviation according to the denominator of equation (7) iscompletely arbitrary. The contour lines could be gridded at anyconvenient value such as 0.25σ, 0.5σ, 2σ as long as the contour linesare plotted in accordance with the denominator of equation (7) or asimilar nonparamatric representation of error.

From FIG. 5 it is evident that the contour lines follow the errorenvelope. The value of d is proportional to the number of contours thata particular measurement falls away from log(Expression Ratio)=0. Thusthe errors are distributed with respect to the contours similarly at lowand at high intensity, and d has the desired property. One advantage ofplotting contour lines is that the amount of error associated with eachcellular constituent measured on the microarray can be calculated basedon information derived from the variance of all the cellularconstituents on the microarray across a plurality of measurements. Thus,by using grid lines as plotted in FIG. 5, the significance of anydeviation between X_(i) and Y_(i), in a two-color fluorescent probehybridization experiment, where i is a particular cellular constituent,will be placed in the context of the entire error envelope using anequation such as the denominator of equation 7. This provides anintensity independent method for determining the reliability ofmeasurement made of particular cellular constituents in microarrayexperiments including two-fluorophore or single-fluorophore experiments.

In addition to depending on intensity, error levels also may begene-specific, again violating the assumption underlying Equations (5)and (6). In this case we may define for any gene, in analogy to Equation(7), $\begin{matrix}{d = \frac{\left( {X - Y} \right)}{\sigma_{X - Y}}} & (7.1)\end{matrix}$where σ_(X-Y) is the standard error (rms uncertainty) associated withthat gene. This uncertainty may be derived from repeated controlexperiments where X and Y are derived from the same biological system,in which case σ_(X-Y) is the observed standard deviation of X-Y for thatgene over the set of experiments. This definition of d then is similarlydistributed for all genes, and (5) and (6) may be used with ranking d.5.4 Combination of Multiple Experiments Using Weighted Average Protocols

Repeated measurements may be combined to yield a quantitative expressionlevel or expression ratio with smaller error bars than individualmeasurements. Weighting the averaging procedure according to theindividual experimental error levels requires knowing or assumingsomething about the error behavior for each measured quantity. Ingeneral, an unbiased weighted mean with minimum variance is achieved bythe formula $\begin{matrix}{x = \frac{\sum\left( {x_{i}/\sigma_{i}^{2}} \right)}{\sum\left( {1/\sigma_{i}^{2}} \right)}} & (8)\end{matrix}$where x is the weighted mean of the cellular constituent being measured,x_(i), and each σ_(i) ² is the variance of an individual x_(i). See, forexample, equation 5-6 in “Data Reduction and Error Analysis for thePhysical Sciences”, 1969, Bevington, McGraw-Hill, New York, which isincorporated by reference herein in its entirety.

Each σ_(i) ² in equation (8) may be determined in a variety of ways. Oneapproach is to calculate the error envelope for a microarray experimentusing two nominal repeats of the two-fluorophore microarray experimentin which the only difference between the two experiments is that the twofluorophores utilized are reversed. See e.g. FIG. 4. Alternatively, onlyone fluorophore could be utilized. Therefore, there could be nodifference at all in the two nominal repeats that are paired in order todetermine an error envelope. Such a paired experiment is illustrated inFIG. 5. FIG. 5 also illustrates intensity independent contour lines thatare fitted in accordance with the denominator of equation (7). Todetermine individual σ_(i) ², for each individual measurement i, theintensity (x_(i)) is plotted on the appropriate reference plot, such asFIG. 5. For example, in FIG. 5, the intensity of individual measurementswould be plotted along the horizontal axis. Once the horizontal positionis determined, σ_(i) ² is calculated based upon the width of the ±1σintensity independent contour lines at position x_(i) on the referenceplot.

A general formula for the uncertainty of the mean is $\begin{matrix}{\sigma_{x}^{2} = \frac{1}{\sum\left( \frac{1}{\sigma_{i}^{2}} \right)}} & (9)\end{matrix}$in accordance with formula 5-10 of “Data Reduction and Error Analysisfor the Physical Sciences”, supra. Note that when the errors associatedwith the different nominally repeated measurements are equal, the errorin the mean is N^(−1/2) times the individual errors.

In practice the individual errors, σ_(i) ², are themselves uncertain.Inspection of control experiments such as FIG. 5 indicates the roughdistribution of errors, but do not indicate whether individual genes ata particular intensity tend to have larger errors due to peculiaritiesof their RNA extraction or even biological function in the cell. Thus abetter estimate of the error in the weighted mean is obtained by addinga component to equation (9) that accounts for scatter in the repeatedmeasurements. If we denote the observed standard deviation for gene j ass_(j), the error in the mean may be described as: $\begin{matrix}{\sigma_{x} = {\frac{1}{N}\left\lbrack {\left( \sqrt{\frac{1}{\sum\limits_{i}\frac{1}{\sigma_{i}^{2}}}} \right) + {\left( {N - 1} \right)*s_{j}}} \right\rbrack}} & (10)\end{matrix}$where N is the number of repeated measurements. Equation (10)transitions from Equation (9) to the value of the observed scatter,s_(j), as the number of repeats, N, becomes large. Note that s_(j) iscalculated according to traditional statistical methods, such that$\begin{matrix}{s_{j} \cong {\frac{1}{N - 1}{\sum\limits_{i}\left( {x_{i} - \overset{\_}{x}} \right)^{2}}}} & (11)\end{matrix}$where N is the number of measurements, x_(i) are individual measurementsof the intensity of gene j in a particular microarray experiment and{overscore (X)} is the sample mean of the individual measurements. Seee.g. equation 2-10 in “Data Reduction and Error Analysis for thePhysical Sciences”, supra, where s_(j)=σ². An estimate of the error ofthe mean, x, as described by equation (10) is necessary because,equations such as (11) require a large number of nominal repeats (N) inorder to be a true reflection of error. Estimates of error based onequation (9) do not take into consideration the errors that particularmeasurement are susceptible to as illustrated in FIG. 1 and as well asgene specific anomalies. One skilled in the art will note that otherequations that accomplish the transition from equation (9) to equation(10) are possible.

FIG. 6 illustrates the reduction in error obtained with repeatedexperiments, and the consequent gain in information. FIG. 6 a is thesignature plot for a single experiment with the drug CsA, obtained asdescribed in the experimental section infra. One sigma error bars havebeen assigned based on the denominator of equation (7), with values forthe additive and multiplicative error levels taken from controlexperiments. Genes are flagged with their 1-sigma error bars only ifthey are more than 1.5 sigma from the line log(ratio)=0, i.e. only ifthey are up- or down-regulated with confidence greater than 95%. FIG. 6b show the results of forming a weighted mean of four repeats (N=4).Here the same criterion of 1.5 sigma has been applied for flagging errorbars, but many more genes are flagged. Comparison with FIG. 6 aindicates that the number of detections at the 95% confidence hasincreased from 4 to more than 200 genes. Thus, the example illustratesthe additional information about drug response that can be obtained withrepeated measurements provided that measurement error is appropriatelymodeled using equations such as (10).

5.5 Response Profiles

The responses of a biological system to a perturbation, such apharmacological agent, can be measured by observing the changes in thebiological state of the biological system. A response profile is acollection of changes of cellular constituents. The response profile ofa biological system (e.g., a cell or cell culture) to the perturbation mmay be defined as the vector ν^((m)):ν^((m))=[ν₁ ^((m)), . . . ν_(i) ^((m)), . . . ν_(k) ^((m))]  (12)where ν_(i) ^(m) is the amplitude of response of cellular constituent iunder the perturbation m. In some embodiments of response profiles,biological response to the application of a pharmacological agent ismeasured by the induced change in the transcript level of at least 2genes, preferably more than 10 genes, more preferably more than 100genes and most preferably more than 1,000 genes.

In some embodiments, biological response profiles comprise simply thedifference between biological variables before and after perturbation.In some preferred embodiments, the biological response is defined as theratio of cellular constituents before and after a perturbation isapplied.

In some preferred embodiments, ν_(i) ^(m) is set to zero if the responseof gene i is below some threshold amplitude or confidence leveldetermined from knowledge of the measurement error behavior. In suchembodiments, those cellular constituents whose measured responses arelower than the threshold are given the response value of zero, whereasthose cellular constituents whose measured responses are greater thanthe threshold retain their measured response values. This truncation ofthe response vector is suitable when most of the smaller responses areexpected to be greatly dominated by measurement error. After thetruncation, the response vector v^((m)) also approximates a ‘matcheddetector’ (see, e.g., Van Trees, 1968, Detection, Estimation, andModulation Theory Vol. I, Wiley & Sons) for the existence of similarperturbations. It is apparent to those skilled in the art that thetruncation levels can be set based upon the purpose of detection and themeasurement errors. For example, in some embodiments, genes whosetranscript level changes are lower than two fold or more preferably fourfold are given the value of zero.

In some preferred embodiments of response profiles, perturbations areapplied at several levels of strength. For example, different amounts ofa drug may be applied to a biological system to observe its response. Insuch embodiments, the perturbation responses may be interpolated byapproximating each by a single parameterized “model” function of theperturbation strength u. An exemplary model function appropriate forapproximating transcriptional state data is the Hill function, which hasadjustable parameters α, μ₀, and n. $\begin{matrix}{{H(u)} = \frac{{a\left( {u/u_{0}} \right)}^{n}}{1 + \left( {u/u_{0}} \right)^{n}}} & (13)\end{matrix}$

The adjustable parameters are selected independently for each cellularconstituent of the perturbation response. Preferably, the adjustableparameters are selected for each cellular constituent so that the sum ofthe squares of the differences between the model function (e.g., theHill function, Equation 13) and the corresponding experimental data ateach perturbation strength is minimized. This preferable parameteradjustment method is known in the art as a least squares fit. Otherpossible model functions are based on polynomial fitting. More detaileddescription of model fitting and biological response has been disclosedin Friend and Stoughton, Methods of Determining Protein Activity LevelsUsing Gene Expression Profiles, U.S. Provisional Application Ser. No.60/084,742, filed on May 8, 1998, which is incorporated herein byreference in it's entirety for all purposes.

5.6 Projected Profiles

The methods of the invention are useful for comparing augmented profilesthat contain any number of response profile and/or projected profiles.Projected profiles are best understood after a discussion of genesets,which are co-regulated genes. Projected profiles are useful foranalyzing many types of cellular constituents including genesets.

5.6.1 Co-Regulated Genes and Genesets

The use of genesets for representing projected profiles is described inthis and the following subsections and also detailed in U.S. patentapplication Ser. No. 09/179,569 filed Oct. 27, 1998 entitled “Methodsfor using co-regulated genesets to enhance determination andclassification of gene expression” by Friend et al., and U.S. patentapplication serial number to be assigned (Attorney docket number9301-039-999) filed Dec. 23, 1998 by Friend et al., entitled “Methodsfor using co-regulated genesets to enhance determination andclassification of gene expression” which are both incorporated herein byreference in their entireties. Certain genes tend to increase ordecrease their expression in groups. Genes tend to increase or decreasetheir rates of transcription together when they possess similarregulatory sequence patterns, ie., transcription factor binding sites.This is the mechanism for coordinated response to particular signalinginputs (see, e.g., Madhani and Fink, 1998, The riddle of MAP kinasesignaling specificity, Transactions in Genetics 14:151-155; Arnone andDavidson, 1997, The hardwiring of development: organization and functionof genomic regulatory systems, Development 124:1851-1864). Separategenes which make different components of a necessary protein or cellularstructure will tend to co-vary. Duplicated genes (see, e.g., Wagner,1996, Genetic redundancy caused by gene duplications and its evolutionin networks of transcriptional regulators, Biol. Cybern. 74:557-567)will also tend to co-vary to the extent mutations have not led tofunctional divergence in the regulatory regions. Further, becauseregulatory sequences are modular (see, e.g., Yuh et al., 1998, Genomiccis-regulatory logic: experimental and computational analysis of a seaurchin gene, Science 279:1896-1902), the more modules two genes have incommon, the greater the variety of conditions under which they areexpected to co-vary their transcriptional rates. Separation betweenmodules also is an important determinant since co-activators also areinvolved. In summary therefore, for any finite set of conditions, it isexpected that genes will not all vary independently, and that there aresimplifying subsets of genes and proteins that will co-vary. Theseco-varying sets of genes form a complete basis in the mathematical sensewith which to describe all the profile changes within that finite set ofconditions.

5.6.2 Geneset Classification by Cluster Analysis

For many applications, it is desirable to find basis genesets that areco-regulated over a wide variety of conditions. A preferred embodimentfor identifying such basis genesets involves clustering algorithms (forreviews of clustering algorithms, see, e.g., Fukunaga, 1990, StatisticalPattern Recognition, 2nd Ed., Academic Press, San Diego; Everitt, 1974,Cluster Analysis, London: Heinemann Educ. Books; Hartigan, 1975,Clustering Algorithms, New York: Wiley; Sneath and Sokal, 1973,Numerical Taxonomy, Freeman; Anderberg, 1973, Cluster Analysis forApplications, Academic Press: New York).

In some embodiments employing cluster analysis, the expression of alarge number of genes is monitored as biological systems are subjectedto a wide variety of perturbations. A table of data containing the geneexpression measurements is used for cluster analysis. In order to obtainbasis genesets that contain genes which co-vary over a wide variety ofconditions multiple perturbations or conditions are employed. Clusteranalysis operates on a table of data which has the dimension m×k whereinm is the total number of conditions or perturbations and k is the numberof genes measured.

A number of clustering algorithms are useful for clustering analysis.Clustering algorithms use dissimilarities or distances between objectswhen forming clusters. In some embodiments, the distance used isEuclidean distance in multidimensional space: $\begin{matrix}{{I\left( {x,y} \right)} = \left\{ {\sum\limits_{i}\left( {X_{i} - Y_{i}} \right)^{2}} \right\}^{1/2}} & (14)\end{matrix}$where I(x,y) is the distance between gene X and gene Y; X_(i) and Y_(i)are gene expression response under perturbation i. The Euclideandistance may be squared to place progressively greater weight on objectsthat are further apart. Alternatively, the distance measure may be theManhattan distance e.g., between gene X and Y, which is provided by:$\begin{matrix}{{I\left( {x,y} \right)} = {{\sum\limits_{i}{\text{}X_{i}}} - {Y_{i}\text{}^{2}}}} & (15)\end{matrix}$

Again, X_(i) and Y_(i) are gene expression responses under perturbationi. Some other definitions of distances are Chebychev distance, powerdistance, and percent disagreement. Percent disagreement, defined asI(x,y)=(number of X_(i)≈Y_(i))/i, is particularly useful for the methodof this invention, if the data for the dimensions are categorical innature. Another useful distance definition, which is particularly usefulin the context of cellular response, is I=1−r, where r is thecorrelation coefficient between the response vectors X, Y, also calledthe normalized dot product X·Y/|X∥Y|.

Various cluster linkage rules are useful for defining genesets. Singlelinkage, a nearest neighbor method, determines the distance between thetwo closest objects. By contrast, complete linkage methods determinedistance by the greatest distance between any two objects in thedifferent clusters. This method is particularly useful in cases whengenes or other cellular constituents form naturally distinct “clumps.”Alternatively, the unweighted pair-group average defines distance as theaverage distance between all pairs of objects in two different clusters.This method is also very useful for clustering genes or other cellularconstituents to form naturally distinct “clumps.” Finally, the weightedpair-group average method may also be used. This method is the same asthe unweighted pair-group average method except that the size of therespective clusters is used as a weight. This method is particularlyuseful for embodiments where the cluster size is suspected to be greatlyvaried (Sneath and Sokal, 1973, Numerical taxonomy. San Francisco: W. H.Freeman & Co.). Other cluster linkage rules, such as the unweighted andweighted pair-group centroid and Ward's method are also useful for someembodiments of the invention. See., e.g., Ward, 1963, J. Am. Stat Assn.58:236; Hartigan, 1975, Clustering algorithms, New York: Wiley.

As the diversity of perturbations in the clustering set becomes verylarge, the genesets which are clearly distinguishable get smaller andmore numerous. However, even over very large experiment sets, there aresmall genesets that retain their coherence. These genesets are termedirreducible genesets. Typically, a large number of diverse perturbationsare applied to obtain such irreducible genesets.

Often, the clustering of genesets is represented graphically and istermed a ‘tree’. Genesets may be defined based on the many smallerbranches of a tree, or a small number of larger branches by cuttingacross the tree at different levels. The choice of cut level may be madeto match the number of distinct response pathways expected. If little orno prior information is available about the number of pathways, then thetree should be divided into as many branches as are truly distinct.‘Truly distinct’ may be defined by a minimum distance value between theindividual branches. Typical values are in the range 0.2 to 0.4 where 0is perfect correlation and 1 is zero correlation, but may be larger forpoorer quality data or fewer experiments in the training set, or smallerin the case of better data and more experiments in the training set.

Preferably, ‘truly distinct’ may be defined with an objective test ofstatistical significance for each bifurcation in the tree. In one aspectof the invention, the Monte Carlo randomization of the experiment indexfor each cellular constituent's responses across the set of experimentsis used to define an objective test.

In some embodiments, the objective test is defined in the followingmanner:

Let p_(ki) be the response of constituent k in experiment i. Let II(i)be a random permutation of the experiment index. Then for each of alarge (about 100 to 1000) number of different random permutations,construct P_(kII(i)). For each branching in the original tree, for eachpermutation:

-   -   (1) perform hierarchical clustering with the same algorithm        (‘hclust’ in this case) used on the original unpermuted data;    -   (2) compute fractional improvement f in the total scatter with        respect to cluster centers in going from one cluster to two        clusters        f=1−ΣD _(k) ⁽¹⁾ /σD _(k) ⁽²⁾  (16)        where D_(k) is the square of the distance measure for        constituent k with respect to the center (mean) of its assigned        cluster. Superscript 1 or 2 indicates whether it is with respect        to the center of the entire branch or with respect to the center        of the appropriate cluster out of the two subclusters. There is        considerable freedom in the definition of the distance function        D used in the clustering procedure. In these examples, D=1−r,        where r is the correlation coefficient between the responses of        one constituent across the experiment set Vs. the responses of        the other (or vs. the mean cluster response).

The distribution of fractional improvements obtained from the MonteCarlo procedure is an estimate of the distribution under the nullhypothesis that a given branching was not significant. The actualfractional improvement for that branching with the unpermuted data isthen compared to the cumulative probability distribution from the nullhypothesis to assign significance. Standard deviations are derived byfitting a log normal model for the null hypothesis distribution. Usingthis procedure, a standard deviation greater than about 2, for example,indicates that the branching is significant at the 95% confidence level.Genesets defined by cluster analysis typically have underlyingbiological significance.

Another aspect of the cluster analysis method provides the definition ofbasis vectors for use in profile projection described in the followingsections.

A set of basis vectors V has k×n dimensions, where k is the number ofgenes and n is the number of genesets. $\begin{matrix}{V = \begin{bmatrix}V_{1}^{(1)} & . & V_{1}^{(n)} \\. & . & . \\V_{k}^{(1)} & . & V_{k}^{(n)}\end{bmatrix}} & (17)\end{matrix}$V^((n)) _(k) is the amplitude contribution of gene index k in basisvector n. In some embodiments, V^((n)) _(k)=1, if gene k is a member ofgeneset n, and V^((n)) _(k)=0 if gene k is not a member of geneset n. Insome embodiments, V^((n)) _(k) is proportional to the response of gene kin geneset n over the training data set used to define the genesets.

In some preferred embodiments, the elements V^((n)) _(k) are normalizedso that each V^((n)) _(k) has unit length by dividing by the square rootof the number of genes in geneset n. This produces basis vectors whichare not only orthogonal (the genesets derived from cutting theclustering tree are disjoint), but also orthonormal (unit length). Withthis choice of normalization, random measurement errors in profilesproject onto the V^((n)) _(k) in such a way that the amplitudes tend tobe comparable for each n. Normalization prevents large genesets fromdominating the results of similarity calculations.

5.6.3 Geneset Classification Based Upon Mechanisms of Regulation

Genesets can also be defined based upon the mechanism of the regulationof genes. Genes whose regulatory regions have the same transcriptionfactor binding sites are more likely to be co-regulated. In somepreferred embodiments, the regulatory regions of the genes of interestare compared using multiple alignment analysis to decipher possibleshared transcription factor binding sites (Stormo and Hartzell, 1989,Identifying protein binding sites from unaligned DNA fragments, ProcNatl Acad Sci 86:1183-1187; Hertz and Stormo, 1995, Identification ofconsensus patterns in unaligned DNA and protein sequences: alarge-deviation statistical basis for penalizing gaps, Proc of 3rd IntlConf on Bioinformatics and Genome Research, Lim and Cantor, eds., WorldScientific Publishing Co., Ltd. Singapore, pp. 201-216). For example, asExample 3, infra, shows, common promoter sequence responsive to Gcn4 in20 genes may be responsible for those 20 genes being co-regulated over awide variety of perturbations.

The co-regulation of genes is not limited to those with binding sitesfor the same transcriptional factor. Co-regulated (co-varying) genes maybe in the up-stream/down-stream relationship where the products ofup-stream genes regulate the activity of down-stream genes. It is wellknown to those of skill in the art that there are numerous varieties ofgene regulation networks. One of skill in the art also understands thatthe methods of this invention are not limited to any particular kind ofgene regulation mechanism. If it can be derived from the mechanism ofregulation that two genes are co-regulated in terms of their activitychange in response to perturbation, the two genes may be clustered intoa geneset.

Because of lack of complete understanding of the regulation of genes ofinterest, it is often preferred to combine cluster analysis withregulatory mechanism knowledge to derive better defined genesets. Insome embodiments, K-means clustering may be used to cluster genesetswhen the regulation of genes of interest is partially known. K-meansclustering is particularly useful in cases where the number of genesetsis predetermined by the understanding of the regulatory mechanism. Ingeneral, K-mean clustering is constrained to produce exactly the numberof clusters desired. Therefore, if promoter sequence comparisonindicates the measured genes should fall into three genesets, K-meansclustering may be used to generate exactly three genesets with greatestpossible distinction between clusters.

5.6.4 Representing Projected Profiles

The expression value of genes can be converted into the expression valuefor genesets. This process is referred to as projection. In someembodiments, the projection is as follows:P=[P ₁, . . . P_(i), . . . P_(n)]=p·V  (18)wherein, p is the expression profile, P is the projected profile, P_(i)is expression value for geneset i and V is a predefined set of basisvectors. The basis vectors have been previously defined in Equation 17as: $\begin{matrix}{V = \begin{bmatrix}V_{1}^{(1)} & . & V_{1}^{(n)} \\. & . & . \\V_{k}^{(1)} & . & V_{k}^{(n)}\end{bmatrix}} & (19)\end{matrix}$wherein V^((n)) _(k) is the amplitude of cellular constituent index k ofbasis vector n.

In one preferred embodiment, the value of geneset expression is simplythe average of the expression value of the genes within the geneset. Insome other embodiments, the average is weighted so that highly expressedgenes do not dominate the geneset value. The collection of theexpression values of the genesets is the projected profile.

5.6.5 Profile Comparison and Classification

Once the basis genesets are chosen, projected profiles P_(i) may beobtained for any set of profiles indexed by i. Similarities between theP_(i) may be more clearly seen than between the original profiles p_(i)for two reasons. First, measurement errors in extraneous genes have beenexcluded or averaged out. Second, the basis genesets tend to capture thebiology of the profiles p_(i) and so are matched detectors for theirindividual response components. Classification and clustering of theprofiles both are based on an objective similarity metric, call it S,where one useful definition isS _(ij) =S(P _(i) ,P _(j))=P _(i) ·P _(j)/(|P _(i) ∥P _(j)|)  (20)

This definition is the generalized angle cosine between the vectorsP_(i) and P_(j). It is the projected version of the conventionalcorrelation coefficient between p_(i) and p_(j). Profile p_(i) is deemedmost similar to that other profile p_(j) for which S_(ij) is maximum.New profiles may be classified according to their similarity to profilesof known biological significance, such as the response patterns forknown drugs or perturbations in specific biological pathways. Sets ofnew profiles may be clustered using the distance metricD _(ij)=1−S _(ij)  (21)where this clustering is analogous to clustering in the original largerspace of the entire set of response measurements, but has the advantagesjust mentioned of reduced measurement error effects and enhanced captureof the relevant biology.

The statistical significance of any observed similarity S_(ij) may beassessed using an empirical probability distribution generated under thenull hypothesis of no correlation. This distribution is generated byperforming the projection, Equations (19) and (20) for many differentrandom permutations of the constituent index in the original profile p.That is, the ordered set P_(k) are replaced by p_(II(k)) where II(k) isa permutation, for ˜100 to 1000 different random permutations. Theprobability of the similarity S_(ij) arising by chance is then thefraction of these permutations for which the similarity S_(ij)(permuted) exceeds the similarity observed using the original unpermuteddata.

5.7 Methods for Determining Biological Response Profiles

This section provides some exemplary methods for measuring biologicalresponses as well as the procedures necessary to make the reagents usedin such methods.

5.7.1 Preparation of Microarrays

Microarrays are known in the art and consist of a surface to whichprobes that correspond in sequence to gene products (e.g., cDNAs, mRNAs,cRNAs, polypeptides, and fragments thereof), can be specificallyhybridized or bound at a known position. In one embodiment, themicroarray is an array (ie., a matrix) in which each position representsa discrete binding site for a product encoded by a gene (e.g., a proteinor RNA), and in which binding sites are present for products of most oralmost all of the genes in the organism's genome. In a preferredembodiment, the “binding site” (hereinafter, “site”) is a nucleic acidor nucleic acid analogue to which a particular cognate cDNA canspecifically hybridize. The nucleic acid or analogue of the binding sitecan be, e.g., a synthetic oligomer, a full-length cDNA, a less-than fulllength cDNA, or a gene fragment.

Although in a preferred embodiment the microarray contains binding sitesfor products of all or almost all genes in the target organism's genome,such comprehensiveness is not necessarily required. Usually themicroarray will have binding sites corresponding to at least about 50%of the genes in the genome, often at least about 75%, more often atleast about 85%, even more often more than about 90%, and most often atleast about 99%. Preferably, the microarray has binding sites for genesrelevant to the action of a drug of interest or in a biological pathwayof interest. A “gene” is an open reading frame (ORF) of preferably atleast 50, 75, or 99 amino acids from which a messenger RNA istranscribed in the organism (e.g.; if a single cell) or in some cell ina multicellular organism. The number of genes in a genome can beestimated from the number of mRNAs expressed by the organism, or byextrapolation from a well-characterized portion of the genome. When thegenome of the organism of interest has been sequenced, the number ofORFs can be determined and mRNA coding regions identified by analysis ofthe DNA sequence. For example, the Saccharomyces cerevisiae genome hasbeen completely sequenced and is reported to have approximately 6275open reading frames (ORFs) longer than 99; amino acids. Analysis ofthese ORFs indicates that there are 5885 ORFs that are likely to specifyprotein products (Goffeau et al., 1996, Life with 6000 genes, Science274:546-567, which is incorporated by reference in its entirety for allpurposes). In contrast, the human genome is estimated to containapproximately 10⁵ genes.

5.7.2 Preparing Nucleic Acids for Microarrays

As noted above, the “binding site” to which a particular cognate cDNAspecifically hybridizes is usually a nucleic acid or nucleic acidanalogue attached at that binding site. In one embodiment, the bindingsites of the microarray are DNA polynucleotides corresponding to atleast a portion of each gene in an organism's genome. These DNAs can beobtained by, e.g., polymerase chain reaction (PCR) amplification of genesegments from genomic DNA, cDNA (e.g., by RT-PCR), or cloned sequences.PCR primers are chosen, based on the known sequence of the genes orcDNA, that result in amplification of unique fragments (i.e., fragmentsthat do not share more than 10 bases of contiguous identical sequencewith any other fragment on the microarray). Computer programs are usefulin the design of primers with the required specificity and optimalamplification properties. In the case of binding sites corresponding tovery long genes, it will sometimes be desirable to amplify segments nearthe 3′ end of the gene so that when oligo-dT primed cDNA probes arehybridized to the microarray, less-than-full length probes will bindefficiently. Typically each gene fragment on the microarray will bebetween about 50 bp and about 2000 bp, more typically between about 100bp and about 1000 bp, and usually between about 300 bp and about 800 bpin length. PCR methods are well known and are described, for example, inInnis et al. eds., 1990, PCR Protocols: A Guide to Methods andApplications, Academic Press Inc., San Diego, Calif., which isincorporated by reference in its entirety for all purposes. Analternative means for generating the nucleic acid for the microarray isby synthesis of synthetic polynucleotides or oligonucleotides, e.g.,using N-phosphonate or phosphoramidite chemistries (Froehler et al.,1986, Nucleic Acid Res 14:5399-5407; McBride et al., 1983, TetrahedronLett. 24:245-248). Synthetic sequences are between about 15 and about500 bases in length, more typically between about 20 and about 50 bases.In some embodiments, synthetic nucleic acids include non-natural bases,e.g., inosine. As noted above, nucleic acid analogues may be used asbinding sites for hybridization. An example of a suitable nucleic acidanalogue is peptide nucleic acid (see, e.g., Egholm et al., 1993, PNAhybridizes to complementary oligonucleotides obeying the Watson-Crickhydrogen-bonding rules, Nature 365:566-568; see also U.S. Pat. No.5,539,083).

In an alternative embodiment, the binding (hybridization) sites are madefrom plasmid or phage clones of genes, cDNAs (e.g., expressed sequencetags), or inserts therefrom (Nguyen et al., 1995, Differential geneexpression in the murine thymus assayed by quantitative hybridization ofarrayed cDNA clones, Genomics 29:207-209). In yet another embodiment,the polynucleotide of the binding sites is RNA.

5.7.3 Attaching Nucleic Acids to the Solid Surface

The nucleic acid or analogue are attached to a solid support, which maybe made from glass, plastic (e.g., polypropylene, nylon),polyacrylamide, nitrocellulose, or other materials. A preferred methodfor attaching the nucleic acids to a surface is by printing on glassplates, as is described generally by Schena et al., 1995, Quantitativemonitoring of gene expression patterns with a complementary microarray,Science 270:467-470. This method is especially useful for preparingmicroarrays of cDNA. See also DeRisi et al., 1996, Use of a cMicroarrayto analyze gene expression patterns in human cancer, Nature Genetics14:457-460; Shalon et al., 1996, A microarray system for analyzingcomplex DNA samples using two-color fluorescent probe hybridization,Genome Res. 6:639-645; and Schena et al., 1995, Parallel human genomeanalysis; microarray-based expression of 1000 genes, Proc. Natl. Acad.Sci. USA 93:10539-11286.

A second preferred method for making microarrays is by makinghigh-density oligonucleotide arrays. Techniques are known for producingarrays containing thousands of oligonucleotides complementary to definedsequences, at defined locations on a surface using photolithographictechniques for synthesis in situ (see, Fodor et al., 1991,Light-directed spatially addressable parallel chemical synthesis,Science 251:767-773; Pease et al., 1994, Light-directed oligonucleotidearrays for rapid DNA sequence analysis, Proc. Natl. Acad. Sci. USA91:5022-5026; Lockhart et al., 1996, Expression monitoring byhybridization to high-density oligonucleotide arrays, Nature Biotech14:1675; U.S. Pat. Nos. 5,578,832; 5,556,752; and 5,510,270, each ofwhich is incorporated by reference in its entirety for all purposes) orother methods for rapid synthesis and deposition of definedoligonucleotides (Blanchard et al., 1996, High-Density Oligonucleotidearrays, Biosensors & Bioelectronics 11: 687-90). When these methods areused, oligonucleotides (e.g., 20-mers) of known sequence are synthesizeddirectly on a surface such as a derivatized glass slide. Usually, thearray produced contains multiple probes against each target transcript.Oligonucleotide probes can be chosen to detect alternatively splicedmRNAs or to serve as various type of control.

Another preferred method of making microarrays is by use of an inkjetprinting process to synthesize oligonucleotides directly on a solidphase, as described, e.g., in co-pending U.S. patent application Ser.No. 09/008,120 filed on Jan. 16, 1998, by Blanchard entitled “ChemicalSynthesis Using Solvent Microdroplets”, which is incorporated byreference herein in its entirety.

Other methods for making microarrays, e.g., by masking (Maskos andSouthern, 1992, Nuc. Acids Res. 20:1679-1684), may also be used. Inprincipal, any type of array, for example, dot blots on a nylonhybridization membrane (see Sambrook et al., Molecular Cloning—ALaboratory Manual (2nd Ed.), Vol. 1-3, Cold Spring Harbor Laboratory,Cold Spring Harbor, N.Y., 1989), could be used, although, as will berecognized by those of skill in the art, very small arrays will bepreferred because hybridization volumes will be smaller.

5.7.4 Generating Labeled Probes

Methods for preparing total and poly(A)+ RNA are well known and aredescribed generally in Sambrook et al., supra. In one embodiment, RNA isextracted from cells of the various types of interest in this inventionusing guanidinium thiocyanate lysis followed by CsCl centrifugation(Chirgwin et al., 1979, Biochemistry 18:5294-5299). Poly(A)+ RNA isselected by selection with oligo-dT cellulose (see Sambrook et al.,supra). Cells of interest include wild-type cells, drug-exposedwild-type cells, modified cells, and drug-exposed modified cells.

Labeled cDNA is prepared from mRNA by oligo dT-primed or random-primedreverse transcription, both of which are well known in the art (see,e.g. Klug and Berger, 1987, Methods Enzymol. 152:316-325). Reversetranscription may be carried out in the presence of a dNTP conjugated toa detectable label, most preferably a fluorescently labeled dNTP.Alternatively, isolated mRNA can be converted to labeled antisense RNAsynthesized by in vitro transcription of double-stranded cDNA in thepresence of labeled dNTPs (Lockhart et al., 1996, Expression monitoringby hybridization to high-density oligonucleotide arrays, Nature Biotech.14:1675, which is incorporated by reference in its entirety for allpurposes). In alternative embodiments, the cDNA or RNA probe can besynthesized in the absence of detectable label and may be labeledsubsequently, e.g., by incorporating biotinylated dNTPs or rNTP, or somesimilar means (e.g., photo-cross-linking a psoralen derivative of biotinto RNAs), followed by addition of labeled streptavidin (e.g.,phycoerythrin-conjugated streptavidin) or the equivalent.

When fluorescently-labeled probes are used, many suitable fluorophoresare known, including fluorescein, lissamine, phycoerythrin, rhodamine(Perkin Elmer Cetus), Cy2, Cy3, Cy3.5, Cy5, Cy5.5, Cy7, Fluor X(Amersham) and others (see, e.g., Kricka, 1992, Nonisotopic DNA ProbeTechniques, Academic Press San Diego, Calif.). It will be appreciatedthat pairs of fluorophores are chosen that have distinct emissionspectra so that they can be easily distinguished.

In another embodiment, a label other than a fluorescent label is used.For example, a radioactive label, or a pair of radioactive labels withdistinct emission spectra, can be used (see Zhao et al, 1995, Highdensity cDNA filter analysis: a novel approach for large-scale,quantitative analysis of gene expression, Gene 156:207; Pietu et al.,1996, Novel gene transcripts preferentially expressed in human musclesrevealed by quantitative hybridization of a high density cDNA array,Genome Res. 6:492). However, because of scattering of radioactiveparticles, and the consequent requirement for widely spaced bindingsites, use of radioisotopes is a less-preferred embodiment.

In one embodiment, labeled cDNA is synthesized by incubating a mixturecontaining 0.5 mM dGTP, dATP and dCTP plus 0.1 mM dTTP plus fluorescentdeoxyribonucleotides (e.g., 0.1 mM Rhodamine 110 UTP (Perken ElmerCetus) or 0.1 mM Cy3 dUTP (Amersham)) with reverse transcriptase (e.g.,SuperScript™ II, LTI Inc.) at 42° C. for 60 minutes.

5.7.5 Hybridization to Microarrays

Nucleic acid hybridization and wash conditions are optimally chosen sothat the probe “specifically binds” or “specifically hybridizes” to aspecific array site, i.e., the probe hybridizes, duplexes or binds to asequence array site with a complementary nucleic acid sequence but doesnot hybridize to a site with a non-complementary nucleic acid sequence.One polynucleotide sequence is considered complementary to another when,if the shorter of the polynucleotides is less than or equal to 25 bases,there are no mismatches using standard base-pairing rules or, if theshorter of the polynucleotides is longer than 25 bases, there is no morethan a 5% mismatch. Preferably, the polynucleotides are perfectlycomplementary (no mismatches). It can easily be demonstrated thatspecific hybridization conditions result in specific hybridization bycarrying out a hybridization assay including negative controls (see,e.g., Shalon et al., supra, and thee et al, supra).

Optimal hybridization conditions will depend on the length (e.g.,oligomer versus polynucleotide greater than 200 bases) and type (e.g.,RNA, DNA, PNA) of labeled probe and immobilized polynucleotide oroligonucleotide. General parameters for specific (i.e., stringent)hybridization conditions for nucleic acids are described in Sambrook etal., supra, and in Ausubel et al., 1987, Current Protocols in MolecularBiology, Greene Publishing and Wiley-Interscience, New York. When themicroarrays of Schena et al. are used, typical hybridization conditionsare hybridization in 5×SSC plus 0.2% SDS at 65° C. for 4 hours followedby washes at 25° C. in low stringency wash buffer (1×SSC plus 0.2% SDS)followed by 10 minutes at 25° C. in high stringency wash buffer (0.1×SSCplus 0.2% SDS) (Shena et al., 1996, Proc. Natl. Acad. Sci. USA,93:10614). Useful hybridization conditions are also provided in, e.g.,Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier SciencePublishers B. V. and Kricka, 1992, Nonisotopic DNA Probe Techniques,Academic Press San Diego, Calif.

5.8 Computer Implementations

The analytic methods described in the previous sections can beimplemented by use of the following computer systems and according tothe following programs and methods. FIG. 7 illustrates an exemplarycomputer system suitable for implementation of the analytic methods ofthis invention. Computer system 501 is illustrated as comprisinginternal components and being linked to external components. Theinternal components of this computer system include processor element502 interconnected with main memory 503. For example, computer system501 can be an Intel 8086-, 80386-, 80486-, Pentium®, or Pentium®-basedprocessor with preferably 32 MB or more of main memory.

The external components include mass storage 504. This mass storage canbe one or more hard disks (which are typically packaged together withthe processor and memory). Such hard disks are preferably of 1 GB orgreater storage capacity. Other external components include userinterface device 505, which can be a monitor, together with inputingdevice 506, which can be a “mouse”, or other graphic input devices (notillustrated), and/or a keyboard. A printing device 508 can also beattached to the computer 501.

Typically, computer system 501 is also linked to network link 507, whichcan be part of an Ethernet link to other local computer systems, remotecomputer systems, or wide area communication networks, such as theInternet. This network link allows computer system 501 to share data andprocessing tasks with other computer systems.

Loaded into memory during operation of this system are several softwarecomponents, which are both standard in the art and special to theinstant invention. These software components collectively cause thecomputer system to function according to the methods of this invention.These software components are typically stored on mass storage 504.Software component 510 represents the operating system, which isresponsible for managing computer system 501 and its networkinterconnections. This operating system can be, for example, of theMicrosoft Windows' family, such as Windows 3.1, Windows 95, Windows 98,or Windows NT. Software component 511 represents common languages andfunctions conveniently present on this system to assist programsimplementing the methods specific to this invention. Many high or lowlevel computer languages can be used to program the analytic methods ofthis invention. Instructions can be interpreted during run-time orcompiled. Preferred languages include C/C++, FORTRAN and JAVA®. Mostpreferably, the methods of this invention are programmed in mathematicalsoftware packages that allow symbolic entry of equations and high-levelspecification of processing, including algorithms to be used, therebyfreeing a user of the need to procedurally program individual equationsor algorithms. Such packages include Matlab from Mathworks (Natick,Mass.), Mathematica from Wolfram Research (Champaign, Ill.), or S-Plusfrom Math Soft (Cambridge, Mass.). Accordingly, software component 512and/or 513 represents the analytic methods of this invention asprogrammed in a procedural language or symbolic package. In an exemplaryimplementation, to practice the methods of the present invention, a userfirst loads differential microarray experiment data into the computersystem 501. These data can be directly entered by the user from monitor505, keyboard 506, or from other computer systems linked by networkconnection 507, or on removable storage media such as a CD-ROM, floppydisk (not illustrated), tape drive (not illustrated), ZIP® drive (notillustrated) or through the network (507). Next the user causesexecution of expression profile analysis software 512 which performs themethods of the present invention.

In another exemplary implementation, a user first loads microarrayexperiment data into the computer system. This data is loaded into thememory from the storage media (504) or from a remote computer,preferably from a dynamic geneset database system, through the network(507). Next the user causes execution of software that performs thesteps of fluorophore bias removal, the rank-based methods of the presentinvention or the weighted averaging protocols of the present invention.

Alternative computer systems and software for implementing the analyticmethods of this invention will be apparent to one of skill in the artand are intended to be comprehended within the accompanying claims. Inparticular, the accompanying claims are intended to include thealternative program structures for implementing the methods of thisinvention that will be readily apparent to one of skill in the art.

6 EXPERIMENTAL

The following section details how reagents are prepared for theexperiments illustrated in FIGS. 2-6.

Construction, Growth and Drug-Treatment of Yeast Strains

The strains used in this study were constructed by standard techniques.See e.g. Schiestl et al., 1993, Introducing DNA into yeast bytransformation, Methods: A companion to Methods in Enzymology 5:79-85.For experiments involving FK506, cells were grown for three generationsto a density of 1×10⁷ cells/ml in YAPD medium (YPD plus 0.004% adenine)supplemented with 10 mM calcium chloride as previously described byGarrett-Engele et al., 1995, Calcineurin, the Ca²⁺/calmodulin-dependentprotein phosphatase, is essential in yeast mutants with cell integritydefects and in mutants that lack functional vacuolar H(+)-ATPase, Mol.Cell. Biol. 15:4103-4114. Where indicated, FK506 was added to a finalconcentration of 1 μg/ml 0.5 hr after inoculation of the culture.Cyclosporin A (CsA) was added to a concentration of 30 μg/ml. Cells werebroken by standard procedures (See e.g. Ausubel et al., CurrentProtocols in Molecular Biology, John Wiley & Sons, Inc. (New York),12.12.1-13.12.5) with the following modifications. Cell pellets wereresuspended in breaking buffer (0.2M Tris HCl pH 7.6, 0.5M NaCl, 10 mMEDTA, 1% SDS), vortexed for 2 minutes on a VWR multitube vortexer atsetting 8 in the presence of 60% glass beads (425-600 μm mesh; Sigma)and phenol:chloroform (50:50, v/v). Following separation, the aqueousphase was reextracted and ethanol precipitated. Poly A⁺ RNA was isolatedby two sequential chromatographic purifications over oligo dT cellulose(NEB) using established protocols. See e.g. Ausubel et al., supra).

Preparation and Hybridization of the Labeled Sample

Fluorescently-labeled cDNA Was prepared, purified and hybridizedessentially as described by DeRisi et al. DeRisi et al., 1997, Exploringthe metabolic and genetic control of gene expression on a genomic scale,Science 278:680-686. Briefly, Cy3- or Cy5-dUTP (Amersham) wasincorporated into cDNA during reverse transcription (Superscript II,LTI, Inc.) And purified by concentrating to less than 10 μl usingMicrocon-30 microconcentrators (Amicon). Paired cDNAs were resuspendedin 20-2611 hybridization solution (3×SSC, 0.75 μg/ml poly A DNA, 0.2%SDS) and applied to the microarray under a 22×30 mm coverslip for 6 hrat 63° C., all according to DeRisi et al., (1997), supra.

Fabrication and Scanning of Microarrays

PCR products containing common 5′ and 3′ sequences (Research Genetics)were used as templates with amino-modified forward primer and unmodifiedreverse primers to PCR amplify 6065 ORFs from the S. cervisiae genome.First pass success rate was 94%. Amplification reactions that gaveproducts of unexpected sizes were excluded from subsequent analysis.ORFs that could not be amplified from purchased templates were amplifiedfrom genomic DNA. DNA samples from 100 μl reactions were isopropanolprecipitated, resuspended in water, brought to 3×SSC in a total volumeof 15 μl, and transferred to 384-well microtiter plates (Genetix). PCRproducts were spotted into 1×3 inch polylysine-treated glass slides by arobot built according to specifications provided in Schena et al.,supra; DeRisi et al., 1996, Discovery and analysis of inflammatorydisease-related genes using microarrays, PNAS USA, 94:2150-2155; andDeResi et al., (1997). After printing, slides were processed followingpublished protocols. See DeResi et al., (1997).

Microarrays were images on a prototype multi-frame CCD camera indevelopment at Applied Precision, Inc. (Seattle, Wash.). Each CCD imageframe was approximately 2 mm square. Exposure time of 2 sec in the Cy5channel (white light through Chroma 618-648 nm excitation filter, Chroma657-727 nm emission filter) and 1 sec in the Cy3 channel (Chroma 535-560nm excitation filter, Chroma 570-620 nm emission filter) were doneconsecutively in each fram before moving to the next, spatiallycontiguous frame. Color isolation between the Cy3 and Cy5 channels was−100:1 or better. Frames were knitted together in software to make thecomplete images. The intensity of spots (−100 μm) were quantified fromthe 10 μm pixels by frame background subtraction and intensity averagingin each channel. Dynamic range of the resulting spot intensities wastypically a ration of 1000 between the brightest spots and thebackground-subracted additive error level. Normalization between thechannels was accomplished by normalizing each channel to the meanintensities of all genes. This procedure is nearly equivalent tonormalization between channels using the intensity ration of genomic DNAspots (See DeRisi et al., 1997), but is possibly more robust since it isbased on the intensities of several thousand spots distributed over thearray.

Determination of Signature Correlation Coefficients and their ConfidenceLimits

Correlation coefficients between the signature ORFs of variousexperiments were calculated using$\rho = {\sum\limits_{k}{x_{k}{y_{k}/\left( {\sum\limits_{k}{x_{k}^{2}{\sum\limits_{k}y_{k}^{2}}}} \right)^{1/2}}}}$where X_(k) is the log₁₀ of the expression ratio for the k'th gene inthe x signature, and y_(k) is the log₁₀ of the expression ratio for thek'th gene in the y signature. The summation is over those genes thatwere either up- or down-regulated in either experiment at the 95%confidence level. These genes each had a less than 5% chance of beingactually unregulated (having expression ratios departing from unity dueto measurement errors alone). This confidence level was assigned basedon an error model which assigns a lognormal probability distribution toeach gene's expression ratio with characteristic width based on theobserved scatter in its repeated measurements (repeated arrays at thesame nominal experimental conditions) and on the individual arrayhybridization quality. This latter dependence was derived from controlexperiments in which both Cy3 and Cy5 samples were derived from the sameRNA sample. For large numbers of repeated measurements the error reducesto the observed scatter. For a single measurement the error is based onthe array quality and the spot intensity.

Random measurement errors in the x and y signatures tend to bias thecorrelation toward zero. In most experiments the great majority of genesis not significantly affected but do exhibit small random measurementerrors. Selecting only the 95% confidence genes for the correlationcalculation, rather than the entire genome, reduces this bias and makesthe actual biological correlations more apparent.

Correlations between a profile and itself are unity by definition. Errorlimits on the correlation are 95% confidence limits based on theindividual measurement error bars, and assuming uncorrelated errors.They do not include the bias mentioned above; thus, a departure of ρfrom unity does not necessarily mean that the underlying biologicalcorrelation is imperfect. However, a correlation of 0.7±0.1, forexample, is very significantly different from zero. Small (magnitude ofρ<0.2) but formally significant correlation in the tables and textprobably are due to small systematic biases in the Cy5/Cy3 ratios whichviolate the assumption of independent measurement errors used togenerate the 95% confidence limits. Therefore, these small correlationvalues should be treated as not significant. A likely source ofuncorrected systematic bias is the partially corrected scanner detectornonlinearity that differentially affects the Cy3 and Cy5 detectionchannels.

The 1 μg/ml FK506 treatment signature was compared to over 40 unrelateddeletion mutant or drug signatures. These control profiles hadcorrelation coefficients with the FK506 profile which were distributedaround zero (mean ρ=−0.03) with a standard deviation of 0.16 (data notshown) and none had correlations greater than ρ=0.38. Similarly, thecalcineurin mutant signature correlated well with the CsA-treatmentsignature (ρ=0.71±0.04) but not with the signatures from the negativecontrol signatures (mean p=−0.02 with a standard deviation of 0.18).

Quality controls

End-to-end checks on expression ratio measurement accuracy were providedby analyzing the variance in repeated hybridizations using the same mRNAlabeled with both Cy3 and Cy5, and also using Cy3 and Cy5 mRNA samplesisolated from independent cultures of the same nominal strain andconditions. Biases undetected with this procedure, such as gene-specificbiases presumably due to differential incorporation of Cy3- and Cy5-dUTPinto cDNA, were minimized by performing hybridizations influorophore-reversed pairs, in which the Cy3/Cy5 labeling of thebiological conditions was reversed in one experiment with respect to theother. The expression ratio for each gene is then the ratio of ratiosbetween the two experiments in the pair. Other biases are removed byalgorithmic numerical detrending. The magnitude of these biases in theabsence of detrending and fluorophore reversal is typically on the/orderof 30% in the ratio, but may be as high as twofold for some ORFs.

Expression ratios are based on mean intensities over each spot. Theoccasional smaller spots have fewer image pixels in the average. Thisdoes not degrade accuracy noticeably until the number of pixels fallsbelow ten, in which case the spot is rejected from the data set. Wanderof spot positions with respect to the nominal grid is adaptively trackedin array subregions by the image processing software. Unequal spotwander within a subregion greater than half a spot spacing isproblematic for the automated quantitating algorithms; in this case thespot is rejected from analysis based on human inspection of the wander.Any spots partially overlapping are excluded from the data set. Lessthan 1% of spots typically are rejected for these reasons.

7 REFERENCES CITED

All references cited herein are incorporated herein by reference intheir entirety and for all purposes to the same extent as if eachindividual publication or patent or patent application was specificallyand individually indicated to be incorporated by reference in itsentirety for all purposes.

Many modifications and variations of this invention can be made withoutdeparting from its spirit and scope, as will be apparent to thoseskilled in the art. The specific embodiments described herein areoffered by way of example only, and the invention is to be limited onlyby the terms of the appended claims, along with the full scope ofequivalents to which such claims are entitled.

1. A method of fluorophore bias removal comprising the steps of: (a)labeling a first pool of genetic matter, derived from a biologicalsystem representing a baseline state, with a first fluorophore to obtaina first pool of fluorophore-labeled genetic matter; (b) labeling asecond pool of genetic matter, derived from a biological systemrepresenting a perturbed state, with a second fluorophore to obtain asecond pool of fluorophore-labeled genetic matter; (c) labeling a thirdpool of genetic matter, derived from said biological system representingsaid baseline state, with said second fluorophore to obtain a third poolof fluorophore-labeled genetic matter; (d) labeling a fourth pool ofgenetic matter, derived from said biological system representing saidperturbed state, with said first fluorophore to obtain a fourth pool offluorophore-labeled genetic matter; (e) contacting said first pool offluorophore-labeled genetic matter and said second pool offluorophore-labeled genetic matter with a first microarray underconditions such that hybridization can occur, respectively detecting afirst and second flourescent emission signal at each of a plurality ofdiscrete loci on the first microarray from said first and second pool offluorophore-labeled genetic matter that is bound to said firstmicroarray under said conditions, and determining a first color ratiobetween said first flourescent emission signal and said secondflourescent emission signal; (f) contacting said third pool offluorophore-labeled genetic matter and said fourth pool offluorophore-labeled genetic matter with a second microarray underconditions such that hybridization can occur, respectively detecting athird and fourth flourescent emission signal at each of a plurality ofdiscrete loci on the second microarray from said third and fourth poolof fluorophore-labeled genetic matter that is bound to said secondmicroarray under said conditions, and determining a second color ratiobetween said third flourescent emission signal and said fourthflourescent emission signal; and (g) computing an average color ratio byaveraging said first color ratio and said second color ratio; wherein:said first microarray and said second microarray are similar to eachother, exact replicas of each other, or are identical; and said firstcolor ratio and said second color ratio are determined for the sameindividual component of genetic matter, thereby removing fluorophorebias. 2-42. (canceled)
 43. A method of estimating error of a signalmeasured from a probe spot in a microarray experiment, the methodcomprising: (A) accounting for an additive error in said probe spot; and(B) accounting for a multiplicative error in said probe spot, whereinsaid accounting for said additive error (A) is determined by a variancein a first measure of intensity of said probe spot and a variance in asecond measure of intensity of said probe spot; and said accounting forsaid multiplicative error (B) is determined by said first measure ofintensity of said probe spot and said second measure of intensity ofsaid probe spot.
 44. The method of claim 43 wherein said first measureof intensity corresponds to an intensity of a first fluorophore at saidprobe spot in a first microarray and wherein said second measure ofintensity corresponds to an intensity of a second fluorophore at saidprobe spot in said first microarray.
 45. The method of claim 43 whereinsaid first measure of intensity corresponds to an intensity of afluorophore at said probe spot in a first microarray and wherein saidsecond measure of intensity corresponds to an intensity of saidfluorophore at said probe spot in a second microarray, wherein saidprobe spot in said first microarray corresponds to said probe spot insaid second microarray.
 46. The method of claim 45 wherein said firstmicroarray represents abundance levels of cellular constituents of abiological system when said biological system is in a base line state,and said second microarray represents abundance levels of cellularconstituents of the biological system when said biological system is ina perturbed state.
 47. The method of claim 46 wherein said biologicalsystem is a cell line, a cell culture, or a tissue sample obtained froma subject.
 48. The method of claim 46 wherein said biological system isa Homo sapien.
 49. The method of claim 46 wherein said perturbed stateis achieved by exposing said biological system to a drug candidate or apharmacological agent.
 50. The method of claim 46 wherein said perturbedstate is achieved by an introduction of an exogenous gene into thebiological system or by a deletion of a gene from the biological system.51. The method of claim 46 wherein said perturbed state is achieved by achange in a culture condition of the biological system.
 52. The methodof claim 43, wherein said additive error comprises:σ_(X) ²+σ_(Y) ² wherein σ_(X) ² is said variance in said first measureof intensity; and σ_(Y) ² is said variance in said second measure ofintensity.
 53. The method of claim 43, wherein said multiplicative errorcomprises:f²(X²+Y²) wherein X is the first measure of intensity of said probespot; Y is the second measure of intensity of said probe spot; and f isa fractional multiplicative error.
 54. The method of claim 53, themethod further comprising: deriving a value for said fractionalmultiplicative error f by fitting:{square root}{square root over (σ_(X) ²+σ_(Y) ²+f²(X²+Y²))} to aplurality of probe spots on a microarray, said plurality of probe spotscomprising said probe spot, wherein σ_(X) ² is said variance in saidfirst measure of intensity; and σ_(Y) ² is said variance in said secondmeasure of intensity.
 55. A computer program product for use inconjunction with a computer system, the computer program productcomprising a computer readable storage medium and a computer programmechanism embedded therein, wherein the computer program mechanism isfor estimating error of a signal measured from a probe spot in amicroarray experiment, the computer program mechanism comprising:instructions for accounting for an additive error in said probe spot;and instructions for accounting for a multiplicative error in said probespot, wherein said accounting for said additive error is determined by avariance in a first measure of intensity of said probe spot and avariance in a second measure of intensity of said probe spot; and saidaccounting for said multiplicative error is determined by said firstmeasure of intensity of said probe spot and said second measure ofintensity of said probe spot.
 56. The computer program product claim 55wherein said first measure of intensity corresponds to an intensity of afirst fluorophore at said probe spot in a first microarray and whereinsaid second measure of intensity corresponds to an intensity of a secondfluorophore at said probe spot in said first microarray.
 57. Thecomputer program product of claim 55 wherein said first measure ofintensity corresponds to an intensity of a fluorophore at said probespot in a first microarray and wherein said second measure of intensitycorresponds to an intensity of said fluorophore at said probe spot in asecond microarray, wherein said probe spot in said first microarraycorresponds to said probe spot in said second microarray.
 58. Thecomputer program product of claim 57 wherein said first microarrayrepresents abundance levels of cellular constituents of a biologicalsystem when said biological system is in a base line state, and saidsecond microarray represents abundance levels of cellular constituentsof the biological system when said biological system is in a perturbedstate.
 59. The computer program product of claim 58 wherein saidbiological system is a cell line, a cell culture, or a tissue sampleobtained from a subject.
 60. The computer program product of claim 58wherein said biological system is a Homo sapien.
 61. The computerprogram product of claim 58 wherein said perturbed state is achieved byexposing said biological system to a drug candidate or a pharmacologicalagent.
 62. The computer program product of claim 58 wherein saidperturbed state is achieved by an introduction of an exogenous gene intothe biological system or by a deletion of a gene from the biologicalsystem.
 63. The computer program product of claim 58 wherein saidperturbed state is achieved by a change in a culture condition of thebiological system.
 64. The computer program product of claim 55, whereinsaid additive error comprises:σ_(X) ²+σ_(Y) ² wherein σ_(X) ² is said variance in said first measureof intensity; and σ_(Y) ² is said variance in said second measure ofintensity.
 65. The computer program product of claim 55, wherein saidmultiplicative error comprises:f²(X²+Y²) wherein X is the first measure of intensity of said probespot; Y is the second measure of intensity of said probe spot; and f isa fractional multiplicative error.
 66. The computer program product ofclaim 65, the computer program mechanism further comprising:instructions for deriving a value for said fractional multiplicativeerror f by fitting:{square root}{square root over (σ_(X) ²+σ_(Y) ²+f²(X²+Y²))} to aplurality of probe spots on a microarray, said plurality of probe spotscomprising said probe spot, wherein σ_(X) ² is said variance in saidfirst measure of intensity; and σ_(Y) ² is said variance in said secondmeasure of intensity.
 67. A computer system for estimating error of asignal measured from a probe spot in a microarray experiment, thecomputer system comprising a processor, and a memory encoding one ormore programs coupled to the processor, wherein the one or more programscause the processor to perform a method comprising: accounting for anadditive error in said probe spot; and accounting for a multiplicativeerror in said probe spot, wherein said accounting for said additiveerror is determined by a variance in a first measure of intensity ofsaid probe spot and a variance in a second measure of intensity of saidprobe spot; and said accounting for said multiplicative error isdetermined by said first measure of intensity of said probe spot andsaid second measure of intensity of said probe spot.
 68. A method ofestimating error of a signal from a probe spot in a two-fluorophoremicroarray experiment, the method comprising: empirically fitting avalue f such that the expression:{square root}{square root over (σ_(X) _(i) ²+σ_(Y) _(i) ²+f²(X_(i)²+Y_(i) ²))} fits an x-y plot for a plurality of probe spots in saidtwo-fluorophore microarray experiment, wherein said plurality of probespots comprises said probe spot, and wherein a first axis of said x-yplot represents, for each respective probe spot in said plurality ofprobe spots, the mean of (i) a measure of intensity of a firstfluorophore for said respective probe spot, and (ii) a measure ofintensity of a second fluorophore for said respective probe spot; asecond axis of said x-y plot represents a ratio, for each respectiveprobe spot in said plurality of probe spots, between (iii) a measure ofintensity of said first fluorophore for said respective probe spot and(iv) a measure of intensity of said second fluorophore for saidrespective probe spot; X_(i) ² is a measure of intensity of said firstfluorophore in an i^(th) probe spot in said plurality of probe spots;Y_(i) ² is a measure of intensity of said second fluorophore in saidi^(th) probe spot; σ_(X) _(i) ² is a variance in X_(i) ² that representsan additive error of X_(i) ²; and σ_(Y) _(i) ² is a variance in Y_(i) ²that represents an additive error of Y_(i) ².
 69. The method of claim 68wherein X_(i) ² is measured from said probe spot in a first microarrayand wherein Y_(i) ² is measured from said probe spot in said firstmicroarray.
 70. The method of claim 68 wherein X_(i) ² is measured fromsaid probe spot in a first microarray and wherein Y_(i) ² is measuredfrom said probe spot in a second microarray, wherein said probe spot insaid first microarray corresponds to said probe spot in said secondmicroarray.
 71. The method of claim 70 wherein said first microarrayrepresents abundance levels of cellular constituents of a biologicalsystem when said biological system is in a base line state, and saidsecond microarray represents abundance levels of cellular constituentsof the biological system when said biological system is in a perturbedstate.
 72. The method of claim 71 wherein said biological system is acell line, a cell culture, or a tissue sample obtained from a subject.73. The method of claim 71 wherein said biological system is a Homosapien.
 74. The method of claim 71 wherein said perturbed state isachieved by exposing said biological system to a drug candidate or apharmacological agent.
 75. The method of claim 71 wherein said perturbedstate is achieved by an introduction of an exogenous gene into thebiological system or by a deletion of a gene from the biological system.76. The method of claim 71 wherein said perturbed state is achieved by achange in a culture condition of the biological system.
 77. A computerprogram product for use in conjunction with a computer system, thecomputer program product comprising a computer readable storage mediumand a computer program mechanism embedded therein, wherein the computerprogram mechanism is for estimating error of a probe spot in atwo-fluorophore microarray experiment, the computer program mechanismcomprising instructions for empirically fitting a value f such that theexpression:{square root}{square root over (σ_(X) _(i) ²+σ_(Y) _(i) ²+f²(X_(i)²+Y_(i) ²))} fits an x-y plot for a plurality of probe spots in saidtwo-fluorophore microarray experiment, wherein said plurality of probespots comprises said probe spot, and wherein a first axis of said x-yplot represents, for each respective probe spot in said plurality ofprobe spots, the mean of (i) a measure of intensity of a firstfluorophore for said respective probe spot, and (ii) a measure ofintensity of a second fluorophore for said respective probe spot; asecond axis of said x-y plot represents a ratio, for each respectiveprobe spot in said plurality of probe spots, between (iii) a measure ofintensity of said first fluorophore for said respective probe spot and(iv) a measure of intensity of said second fluorophore for saidrespective probe spot; X_(i) ² is a measure of intensity of said firstfluorophore in an i^(th) probe spot in said plurality of probe spots;Y_(i) ² is a measure of intensity of said second fluorophore in saidi^(th) probe spot; σ_(X) _(i) ² is a variance in X_(i) ² that representsan additive error of X_(i) ²; and σ_(Y) _(i) ² is a variance in Y_(i) ²that represents an additive error of Y_(i) ².
 78. A computer system forestimating error of a probe spot in a two-fluorophore microarrayexperiment, the computer system comprising a processor, and a memoryencoding one or more programs coupled to the processor, wherein the oneor more programs cause the processor to empirically fit a value f suchthat the expression:{square root}{square root over (σ_(X) _(i) ²+σ_(Y) _(i) ²+f²(X_(i)²+Y_(i) ²))} fits an x-y plot for a plurality of probe spots in saidtwo-fluorophore microarray experiment, wherein said plurality of probespots comprises said probe spot, and wherein a first axis of said x-yplot represents, for each respective probe spot in said plurality ofprobe spots, the mean of (i) a measure of intensity of a firstfluorophore for said respective probe spot, and (ii) a measure ofintensity of a second fluorophore for said respective probe spot; asecond axis of said x-y plot represents a ratio, for each respectiveprobe spot in said plurality of probe spots, between (iii) a measure ofintensity of said first fluorophore for said respective probe spot and(iv) a measure of intensity of said second fluorophore for saidrespective probe spot; X_(i) ² is a measure of intensity of said firstfluorophore in an i^(th) probe spot in said plurality of probe spots;Y_(i) ² is a measure of intensity of said second fluorophore in saidi^(th) probe spot; σ_(X) _(i) ² is a variance in X_(i) ² that representsan additive error of X_(i) ²; and σ_(Y) _(i) ² is a variance in Y_(i) ²that represents an additive error of Y_(i) ².